Detailed cross comparison of building energy simulation tools results using a reference office building as a case study

Building Energy Simulation (BES) tools play a key role in the optimization of the building system during the different phases, from pre-design through commissioning to operation. BES tools are increasingly used in research as well as in companies. New BES tools and updated versions are continuously being released. Each tool follows an independent validation process but rarely all the tools are compared against each other using a common case study. In this work, the modelling approaches of widespread dynamic simulation tools (i.e. EnergyPlus, TRNSYS, Simulink libraries CarnotUIBK and ALMABuild, IDA ICE, Modelica/Dymola and DALEC), as well as PHPP (a well-known quasi-steady-state tool), are described and the results of all the tools modelling the same characteristic ofﬁce cell, deﬁned within the IEA SHC Task 56, are compared on a monthly and hourly basis for the climates of Stockholm, Stuttgart and Rome. Unfortunately, different tools require different levels of input detail, which are often not matching with available data, hence the parametrization process highly inﬂuences the quality of the simulation results. In the current study to evaluate the deviation between the tools, frequently used statistical indices and normalization methods are analysed and the problems related to their application, in a cross-comparison of different tools, are investigated. In this regard, the deviation thresholds indicated by ASHRAE Guideline 14-2014 are used as a basis to identify results that suggest an acceptable level of disagreement between the predictions of a particular model and the outcomes of all models. The process of reaching a good agreement between all tools required several iterations and great effort on behalf of the modellers. To aid the deﬁnition of building component descriptions and future references for inter-model comparison a short history of the executed steps is presented in this work. Together with the comparison of the results of the tools, their computational cost is evaluated and an overview of the modelling approaches supported by the different tools for this case study is provided aiming to support the users in choosing a ﬁt-for-purpose simulation tool.


Introduction
A large number of Building Energy Simulation (BES) tools, with different focus and level of detail, have been developed in the last six decades [1,2].A comprehensive list of BES is provided in [3] and [4].BES tools are used in research and increasingly in building design, construction, commissioning and operation for accelerating and improving the design and planning process, optimizing building performance, developing building controls, testing new products and evaluating the market potential of novel concepts.
Moreover, ever-stricter building codes and energy standards have stimulated the usage of BES [5].

Model complexity: approximation versus estimation
The two main aspects under which the different tools can be classified are first, the complexity of the mathematical models depending on the purpose and focus of the tool, and second, the possibility to access the BES source code (which is particularly relevant in the research field).When the source code is accessible and can be easily modified, the user can tailor the equations to https://doi.org/10.1016/j.enbuild.2021.1112600378-7788/Ó 2021 Elsevier B.V. All rights reserved.

Contents lists available at ScienceDirect
Energy & Buildings perfectly suit the specific case study increasing the degree of detail 1 .Unfortunately, this will also lead to an increase in the total simulation time, modelling effort and skill that is required.Dynamic simulation tools such as TRNSYS, Modelica/Dymola, IDA ICE, Energy-Plus, and Simulink, focus on the transient behaviour of systems and allow detailed analysis.However, their application requires a high level of user expertise and expert knowledge of many input parameters.Moreover, the modelling and simulation process can be timeconsuming in terms of computational cost [5] and modeller effort.In addition, the suitability of a particular level of model complexity, or a particular modelling tool, can only be assessed in relation to the objectives and constraints of the simulation study.In the present study, in order to develop a meaningful conclusion to a specific problem, both the degree of detail of the model as well as the accuracy of the available input data are the defining aspects.As a result, the potential overall error in performance predictions is a function of both the degree of approximation of the physical phenomena involved as well as the degree of estimation of uncertain input parameters.Therefore, the process of minimization of the overall 1 This is possible only in the case of white and grey box models and not in blackbox models [81].White box models require detailed knowledge of the physical process while black box models do not require full knowledge of the system and are developed using a data-driven approach.Grey box model preserve the physical description of the system, but their parameters are estimated using system identification methods [81].
error involves balancing the complexity of the model with the information, time and other resources available to the simulation study [6].
In general, the process of model parametrisation is error-prone since it is easy to make unrepresentative assumptions or just commit user errors in preparing building input files.Fortunately, the number of inputs to be parametrized can be reduced by using a tool with the degree of detail required by the modelling phase.Typing errors might be avoided by increasing the usage of automated procedures.In this regard, Building Information Modelling (BIM) might have the capability to lead to a more automated process and thus reduce user mistakes.Nevertheless, further development in terms of availability and agreement in information transfer is required to apply BIM to BES in practice (see [7;8]).
In order to assess the extent to which different simulation tools fit the need of a particular simulation study, modellers require an overview of the complexity or degree of abstraction of the component models that are applied in the different tools as well as the input parameters, skill effort and time that are required to use them.In this context, Crawley et al. [3] comprehensively and extensively reported the modelling features for a large number of available tools, which helps the BES user to select the right tool according to the purpose.However, this kind of BES tools comparison needs to be updated since new component models are constantly being developed.In addition, new versions and tools are available such as Matlab/Simulink toolboxes (e.g.CARNOT toolbox [9], CarnotUIBK [10] and ALMABuild [11]), Modelica libraries [12], DALEC [13], PHPP [14].Unfortunately, using a tool with a high degree of detail might not only be time-consuming to create the simulation model, but also for the computation itself.Certainly, some particular aspects can be analysed only if the model describes them.For example, a detailed comfort analysis can only be done if the models calculate the temperature distribution in the room [15].The aspect of the computational cost has been qualitatively mentioned in [16][17][18][19].Quantitative comparisons in terms of the computational time of different tools have been proposed in [20] including DALEC, Radiance, Relux and EnergyPlus, in [21] comparing eQUEST against EnergyPlus, in [22] including Energy-Plus and DOE2 and in [23] where TRNSYS and Modelica have been compared.Yet, this is an important topic especially in the expanding field of simulation-based optimization [24] and studies presenting updated comparison of the computational cost including a wide range of BES tools are needed.
The results of such a comprehensive analysis that also show the impact on the computational time, would be a helpful instrument for supporting the user in the decision of which tool to be used for a certain study.

Validation and trust in BES
According to Feist [25], the five major reasons for the deviations between different simulation models are: (1) different algorithms, (2) numerical errors (errors in the calculation), (3) programming errors (errors in implementation), (4) non-identical inputs, and (5) different processing of the weather data that is used by the BES.Additionally, the results are influenced by different physical approaches with different levels of detail and different numerical schemes.As an example, the thermal zone can be modelled with a 2-star or star-node approach and the wall model with the finite difference method or transfer function.To gain trust in the building simulation, it is important to validate the models.Therefore, many validation studies have been published over the years and the level of attention on this subject has increased significantly in the recent years.According to Judkoff et al. [26] and [27], the results of BES tools can be validated against measurements (empirical validation), analytical solution (analytical validation), or against other codes (cross-validation).In this context, an analytical solution is only available for simple configurations (e.g.wall model and verification of system simulations [7,8]).Empirical validations involving different tools are presented in [16,28], and [29]; this kind of validation has to deal with the measurement uncertainties and is usually limited to a specific amount of data for a limited period of time.Whereas cross-validation allows to minimize the input uncertainties and allows analysis of many different configurations with a high level of detail [5,30,31,32].Nevertheless, the main drawback of the cross-validation approach is that it is difficult to precisely define the set of reference results.
The BESTEST methodology, established by the ANSI/ASHRAE Standard 140-2017 [33] and later updated by [34], EN 15255 [35] and EN 15265 [36] and German guidelines such as VDI 6020 [37], VDI 2078 [38] and VDI 6007 [39], describe different test cases against which the implemented algorithms or models have to be tested to correct code errors, modelling limitations or input errors eventually present in the tool [40].Comparative values are given for the evaluation of the calculated results.However, only specific parameters are analysed within each BESTEST case.The spread between the minimum and maximum thresholds, suggested by BESTEST for the tool validation, can be as large as 47% [34] in some test cases.
Moreover, in both empirical and cross-validations, when only heating and cooling demands are considered and all the other contributions are discarded of the thermal balance (i.e., thermal losses, solar gains, internal gains, ventilation and infiltration losses and temperatures of the surfaces and air), it is difficult to interpret the results since all possible error sources are acting simultaneously [40].Thus, conclusions about the model accuracy have to be derived, preventing as much as possible the offsetting errors by considering as many as possible outputs and different time scales.This is difficult to achieve with empirical validation since the available measurements refer to a limited time frame and do not cover the majority of the outputs of a simulation tool.For these reasons in the current work a detailed cross-comparison, focused on a solar-driven, reference office test cell that includes shading and ventilation control logic, is carried out that considers all the components of the energy balance on a monthly and hourly basis for one year of simulation.

Statistical indices for the quantification of the deviations among the results of the different tools
The definition of accuracy indices is required to quantify the deviations between time series and therefore to assess the goodness of the results of a model against other simulations or measured data.As highlighted in [27], there is the need to find appropriate system performance indices.In this field, Uniform Methods Project [41] and ASHRAE Guideline 14-2014 [42] are widely recognized guidelines aiming at establishing a method for measuring the accuracy of building models as well as the Federal Energy Management Program (FEMP) [43] and the International Performance Measurement and Verification Protocol (IPMVP) [44], which refer to ASHRAE Guideline 14-2014 [42].These documents suggest thresholds, which are different for monthly or hourly calibration, for the Normalized Mean Bias Error (NMBE) and Normalized Root Mean Square Error (NRMSE).In addition, it has been suggested to analyze a time frame of at least one year and to compare the simulation results with the utility bills and/ or spot measurements.However, as highlighted in [45] the indoor condition and temperatures are not addressed in the suggested calibration approaches.Numerous papers using calibration methods are available in the literature, as reported in [45] nevertheless, varying approaches have been applied by different researchers in the previous works due to lack of a standard procedure.These approaches use different indicators, applied to different variables with varying resolution and time frames to quantify the model accuracy.As an example [16] and [29] calculate the statistical indices (i.e.Coefficient of determination (R 2 ), Root Mean Square Error (RMSE) and NRMSE are used in [16]; Mean Error (ME), RMSE, standard deviation and maximum error are used in [29]) based on temperatures, while [46] bases its analysis on the energy consumption of the building calculating the Mean Bias Error (MBE) and the NRMSE for each month and the whole year using hourly data.In general, it is not uncommon to find literature referring to the same index with different names or definitions as highlighted in [47].In the current work, the main statistical indices and normalization means that are used are critically analysed to help shed some light on the difficulties that are encountered so a new normalization method can be proposed.
As reported in [4] a long list of BES tools exist, which includes tools from different countries.However, in the current work, Ener-gyPlus [48], TRNSYS [49], Matlab/Simulink (ALMABuild [11] and CarnotUIBK [10]), the Modelica/Dymola buildings library [50], and IDA ICE [51] are analyzed along with DALEC [20] and PHPP [14], which can be classified as predesign tools.Almost all these tools have already undergone a validation process following the BESTEST method (i.e.EnergyPlus in [52], ALMABuild [11], Modelica [53], IDA [54], TRNSYS [34]), and all the tools except the Simulink libraries ALMABuild and CarnotUIBK have been compared against measured data in [16,28,29,20] and [55].However, they have not been compared against each other using a solar-driven building as a common test case.Finally, only a few recent studies have presented a quantitative and comprehensive comparison in terms of computational time including a wide variety of dynamic building simulation tools.
With each tool, the reference office cell was modelled starting from the same description (specially tailored on the TRNSYS input).Even though almost all the tools are successfully validated using the BESTEST methodology (i.e.EnergyPlus in [52], ALMABuild [11], Modelica [53], IDA [54], TRNSYS [34]), high deviations between the results of the analysed tools were experienced during the first itarations of the cross-comparison process.Indeed, to reach a good agreement between the tools, several iterations were necessary due mainly to a lack of equivalent inputs in the building description, user mistakes and different modelling approaches of the tools under analysis.

Methodology
In this section, the building model used for this case study is described, and the modelling features of the different tools are introduced and the parametrization process is explained.Moreover, the statistical indices are described and the boundary conditions, applied for the evaluation of the computational cost, are reported.

Boundary conditions
Three different locations Rome (ROM), Stuttgart (STU) and Stockholm (STO), characterized by different European climates (Mediterranean, oceanic climate and humid continental climate respectively, according to the Köppen climate classification [56]) were considered in this study.Table 2.1 shows the annual average ambient temperature (# À amb,av ), annual global irradiation on a horizontal surface (I g,hor ) and annual irradiation on a south-oriented vertical surface (I south ) for each location.
All the tools use the same Typical Meteorological Year (TMY2) for each location.Therefore, the ambient temperature and global irradiation on the horizontal were the same (see [57] for more details).Nevertheless, it is important to highlight that DALEC calculates the I south using a diffuse isotropic sky model 2 described in [58] whereas all the other tools implement the Perez model (i.e.EnergyPlus, Simulink, IDA ICE and Modelica used the Perez 1990 model [59] while TRNSYS used the Perez 1999 model (see [60] page 7-99).Therefore, Simulink, EnergyPlus, TRNSYS, Modelica and IDA ICE present negligible deviations in terms of I south , while DALEC underestimates I south in all the climates (see Figure 2.5).To perform a fair comparison, the solar gains of DALEC have been aligned with the solar gains of the other tools by calibrating the shading control thresholds (see Section 2.4).

Building description
The reference building was chosen to be representative of a typical European office cell located on the middle floor of a high-rise building.Figure 2.1 shows the office cell, which has a heated area of 27 m 2 and a volume of 81 m 3 .All the surfaces are considered adiabatic, except for the façade oriented towards the South (with a window-to-wall ratio of 60%) where ambient boundary conditions are applied and solar active technologies such as daylighting systems can be installed 3 .Shadings from adjacent obstacles were not considered, whereas external movable shading, able to block 70% of the incoming radiation, was activated when direct solar radiation impinging the south façade was higher than 120 W/m 2 .
Table 2.2 reports the heat transfer coefficient (HTC) of the opaque wall element and the characteristics of the windows such as HTC, Solar Heat Gain Coefficient (SHGC) and the solar transmittance (s sol ) for the three climates.The HTC of the window can be calculated as the area-weighted HTC of the frame and the glass.The internal walls were typical plasterboard walls, while the exterior wall consists of a three-layer structure with different insulation thicknesses depending on the climate.
User behaviour (e.g., occupancy, appliances and lighting) was taken into account by employing hourly resolution profiles that show different user behaviour for weekday and weekends [61] (see Figure 2.

2).
The natural infiltration rate was assumed to be constant and the air change per hour was set to 0.15 ACH.A fresh air supply of 40 m 3 /h per person was provided by a mechanical ventilation system with a sensible heat recovery of 70% efficiency.A bypass of the heat recovery was activated when the temperature of the zone was higher than 23 °C and the ambient temperature was lower than the indoor temperature (see Figure 2.

3).
A simplified ideal all-air heating and cooling system was used by all the models.The set-point for the indoor convective temperature applied for the heating and cooling control was 21 °C and 25 °C, respectively.When the convective temperature was between 21 °C and 25 °C neither the cooling system nor the heating system is activated.A summary of the building data and boundary conditions is reported in the appendix in Table C1 and further information is reported in [57].

Modelling features of the tools
The tools analysed in this work had different modelling features as follows:  [10] and the University of Bologna (IT) [11] respectively; IDA ICE v.4.8 (IDA) is a commercial software, developed by EQUA Simulation AB.It is focused on detailed and dynamic multi-zone simulation applications for the study of thermal indoor climate as well as the energy consumption of the entire building involving envelope, HVAC systems, plant and control strategies [63].The model inputs are described using the equation-based Neutral Model Format language [64]; DALEC (DAL) is a free web tool developed by Bartenbach (AT), University of Innsbruck (AT) and Zumtobel (AT).The main focus of DALEC is on combined thermal and lighting building simulations in early design phases [13]; MODELICA (MOD) is a non-proprietary, object-oriented, equation-based language for modelling complex physical systems developed by Modelica Association.Currently, several Modelica libraries exist for building components and HVAC systems, and these are continuously being upgraded.In this work, the Buildings library v.5.0.1 was used together with the Modelica standard library [50].The Dymola modeling and simulation environment (v.2020x) was also used; PHPP v.9.1 Passive House Planning Package is a commercial quasi-steady-state calculation tool, developed as a spreadsheet by the Passive House Institute, for use by architects, engineers and planning experts [14].
The different tools implement different models with different levels of detail to approach the numerical solution of the building system using different equations.An extensive description of the equations implemented in TRN, IDA and EP is provided in [16], a description of the mathematical model used in DAL is provided in [20,58] and of MOD in [12].SIM_BO and SIM_IBK do not provide user manuals, while CARNOT toolbox provides only an introduction to the library in [65].
Table 2.3 provides a summary of the approaches applied in this work (in black) and available (in grey) in each tool for different aspects of the building model.The model of the thermal zone is used to calculate the convective and radiative temperature.The most detailed approach would be a CFD simulation considering a distribution of the convective as well as the radiative temperature within the room.Such a model requires a high computational cost and therefore, it is not usually applied in annual energy simulation [66]; however MOD and IDA offer the possibility of using a simplified CFD approach 4 (see [67;68], respectively).EP, TRN, SIM_BO, IDA and MOD, for convex and closed volume, can model the air in the room as one unique node while calculating the radiative exchange between the surfaces using view factors.EP, TRN, SIM_BO and IDA ICE can additionally calculate the view factors between the internal surfaces and a matrix of points in the room, whose location is defined by the user, allowing the calculation of the mean radiative temperature field in the room.The two-star node approaches [69] (implemented in SIM_IBK, SIM_BO) include a convective node and a radiative node (the long-wave radiative exchange between the surfaces is modelled using the star network).In the simplified calculation mode, TRN implements a star network (see [70]  where an artificial temperature node is used to consider the parallel energy flow from the inside wall surface to the zone air by convection and the long-wave radiation exchange between the surfaces.DAL is based on the Standard ISO 13790:2008 [71] where the room thermal balance is solved considering three nodes and both the air and mean radiant temperatures are calculated [20].In DAL the nodes are connected through a specific coupling conductance defined in the Standard ISO 13790:2008 [71].The total thermal capacity of the walls and air volume is connected to the node representing the mean radiant temperature.PHPP is a quasi-steady-state tool based on Standard ISO 13790:2008 [71] that calculates losses and gains on monthly basis considering a fixed set point temperature (different for winter and summer energy balances).
The radiative exchange coefficient can be: Constant when an overall HTC is considered (Table 2.4: R1); Linearized (Table 2.4: R2); Proportional to the temperature difference at the fourth power (Table 2.4: R3); and Calculated considering the view factors between the surfaces (Table 2.4: R4).
Different window models are implemented in the analysed tools, in particular, EP, TRN, SIM_BO, IDA and MOD perform a thermal balance over each pane of the window (see Figure 2.4a), while DAL, SIM_IBK and PHPP are based on a simplified window thermal model where the transmission losses are calculated by using a constant heat transfer coefficient (see Figure 2.4b).
EP, TRN, SIM_BO, IDA, MOD consider the solar transmittance and the distribution of absorbed solar radiation across the multilayer glazing system as a function of the angle of incidence.EP, IDA, MOD explicitly describe this behaviour by recursively solving the transmittance, reflectance and absorptance of solar radiation through each layer, while TRN and SIM_BO by using preprocessed data of multi-layer window system from the LBNL-Window software [72] (see [73] for the WINDOW technical documentation).A description of the implementation of the detailed window modelling approach in EP can be found in [74] and [75].EP, MOD and IDA ease the investigation of particular fenestration systems (e.g.including glass coatings) and require more detailed input.SIM_IBK uses a standard correction function (different for single, double and triple panes windows) of the SHGC at normal incidence, to different incidence angles.DAL discretizes the sky into 145 patches and, for every patch applies a pre-calculated correction factor to the SHGC at normal incidence given by the user (see [20]).
EP and TRN model the opaque structures with the transfer function method, whereas both Simulink libraries, IDA and MOD imple-ment the finite difference method.For more details about these different approaches see: [69] where the theory behind these methods is explained and [76], which provides a detailed comparison between transfer function and finite difference methods.DAL and PHPP implement a simplified model of the walls, based on the HTC of the structures.PHPP is a quasi-steady-state calculation tool where the capacity of the building is used only to calculate the utilization factor of the internal and solar gains as described in the Standard ISO 13790:2008 [71].In DAL the mass of the building is lumped in one capacity as described in the Standard ISO 13790:2008 [71].
The model used for the adiabatic structures influences the active capacity of the building.Both Simulink libraries and IDA apply the same boundary conditions (BC) to both sides of the structure, TRN additionally applies the same thermal resistance between the surface and the air and radiative nodes on both sides.EP implements a null thermal flux on the external side of the adiabatic structure and MOD in the middle of the structure.PHPP and DAL consider the thermal mass in one unique node belonging to the Thermal Zone (TZ).
The isotropic sky model, used in DAL [58], assumes that the diffuse radiation from the skydome is uniform across the sky and predicts lower solar radiation availability on the south façade (see Figure 2.5), compared to the other tools implementing the Perez sky model (see Section 2.1).
The internal gain can be defined as an hourly profile in almost all tools except for PHPP where the internal gain is considered as a constant and in DAL where the internal gain can be defined with a constant value throughout the day, as shown in Figure 2.6.
The control status of the shading and ventilation bypass is defined in every time step for all tools except for PHPP where constant shading and ventilation rate are applied for the summer and winter balance calculation (see Section 2.4, Table 2.7 and Table 2.8).To clarify this aspect, the control status of the shading simulated with SIM_BO for the climate of Stockholm together with the sunset and sunrise time for each day of the year is reported in    2.8).Therefore, the graph reported in Figure 2.7 would be completely black for the PHPP and the value represented by the black colour would be the constant shading coefficient given as input to the PHPP.The same applies to the control of the bypass of the ventilation heat recovery.
Concerning the solver settings, the calculation time step can be defined as constant or can be variable in Simulink, MOD and IDA while it is fixed with the time length defined by the user in TRN and EP.Whereas DAL and PHPP are based on a fixed-time interval of one hour and one month, respectively.
Subportions of the models of the two Simulink libraries could be quite easily exchanged even when this is not foreseen from the structure of the tool itself.The SIM_BO library contains a detailed window model, which is missing in the original SIM_IBK library that has only a simplified model with constant thermal resistance.Therefore, in the current work the complex window model developed in SIM_BO (see Table 2

Parametrization process
Unfortunately, the building description, written using the input required by TRN [57], is either too detailed (this is the case for DAL and PHPP) or too simplified (for EP, IDA, MOD) and therefore equivalent inputs had to be found.The list of inputs that had to be adapted is reported in Table 2.5.Since SIM_BO uses the same input as TRN both tools do not need any parametrization.
As reported in [77] the window properties and heat transfer model have a substantial influence on the results, which is particularly evident in this case study.The model of the C1 Different methods are applied for the detailed radiative model within the different tools.TRN implements the Gebhart matrix, EP the ScriptF method similarly, in IDA a longwave absorption matrix is used to calculate the net absorbed longwave radiation (these approaches are reported in [16]) and MOD [83] and SIM_BO [15] are using the radiosity approach.In this case study the ideal loads air system model is used in EP.In this model heating and cooling energy are convectively supplied in sufficient quantity to meet zone loads.This model offers default controls for heat recovery and outdoor air delivery that differ from the controls described for this case study (see Section 2.1).EMS scripts were therefore used in EP to control heat recovery bypass and heat gains from ventilation frost protection.More detailed HVAC models could have been selected in EP.However, these would have required further estimation of unknown input parameters.fenestration system in EP and MOD requires optical and thermal properties to be assigned for each glazing pane.Yet, the description of the input, defined for TRN [57], only describes the optical behaviour of the overall glazing system as a function of the angle of incidence.To obtain the detailed glazing properties for EP and MOD, the solar transmittance and reflectance of the glazing panes were varied and the results were compared with SIM_BO, which uses the same input as TRN.In EP this was done using an exhaustive search approach and in MOD a trial and error process was followed.The detailed glazing properties were then selected that gave the smallest difference ( 0.1 kWh/m 2 ) in terms of the predicted annual sums of the solar energy that is transmitted by the glazing system, and the energy that is absorbed at each of the two panes.The resulting inputs that were found for the window models in MOD and EP are the same for ROM while small differences for the window properties in STO and STU are present, as reported in the appendix (see Figure A1).Table 2.6 reports the HTC of the glass and façade, defined in a report from the current study [57], derived from SIM_BO and the one used in IDA, DAL and PHPP.SIM_BO dynamically calculates the HTC of the window since it depends on the boundary conditions, therefore, the constant HTC reported in Table 2.6, for SIM_BO, is calculated by finding a constant HTC of the glass, which delivers the same annual thermal losses calculated with the dynamic HTC.The discrepancy between the HTC calculated by Simulink and the one indicated in the report [57], which is especially high in Stockholm, is caused by the fact that the report gives this input considering fixed boundary conditions (i.e., 0 amb = 26.7 °C, no solar irradiation, Wind speed = 6.71 m/s, Internal temperature = 21 °C).
In IDA, both complex and simplified window models could be used.However, the complex window model has to be used with a complex shading model involved in the thermal balance of the window while the simplified window model can be used with a simplified shading model that does not affect the thermal balance of the window.The shading device is described just as a reduction factor of the incoming solar radiation impinging the window in this case study [57].Therefore, all the tools except for EP (where this was not easily modified), do not consider the thermal effect of the shading layer on the thermal balance of the window.Hence, in IDA the simplified approach was used too due to this strong simplification of the fixed shading factor.IDA uses a simplified window model with two panes that use a constant thermal resistance (the fenestration system is always simplified as twopane glazing).The value of this constant has been derived from the annual dynamic simulation of SIM_BO.
In IDA it is not possible to increase the capacity of the air node (as suggested in [57]) therefore the thermal mass was artificially increased by adding a wall (i.e., area: 20 m 2 ; thickness: 0.05 m; density: 900 kg/m 3 ; specific heat capacity: 987 J/kgK) with high thermal conductivity (i.e.10000 W/mK) and heat transfer coefficient (i.e., 20000 W/m 2 K).Moreover, IDA controls the mass flow rate of the supply and exhaust air while in [50] the volume flow rate is given and specified as constant.Therefore, the efficiency of the heat recovery system was adjusted (+4% compared to [50]) to compensate for the slightly unbalanced volume flows of the supply and exhaust air.
DAL is an open-access online tool where the code cannot be modified by the user.To adapt the DAL calculation to the building description, an unpublished Matlab version of the code was used.The results of the original web DALEC and the calibration process are reported in Section 3.4.Concerning the online version the following parts have been modified: The lumped capacity of the model was calibrated by comparing the hourly heating and cooling demand with SIM_BO; Ventilation rate: in the original calculation the ventilation system is always active, while in the calibrated version it is active only during the presence of people (as described in [57]); Shading model: in the original version, the shading is modelled considering optical properties dependent on the solar incident angle while in the calibrated version a constant reduction of the incoming solar gain was applied.Given the fact that DAL applies an isotropic sky model, resulting in lower solar irradiation toward the south façade, the shading threshold is modified to make the solar gain comparable; and The overall HTC of the window and wall is selected to minimize the difference in transmission losses between DAL and SIM_BO.
PHPP and DAL base their calculation on a constant HTC for the wall and window therefore, especially for Stockholm, it was necessary to parametrize the (Table 2.6).
Table 2.7 documents the effective volume flow (considering the 70% efficiency of the heat recovery) elaborated by the ventilation unit during the occupied time (from 8 until 18 see Figure 2.2) and the increased volume flow when the bypass is activated.The values used in DAL are found by minimizing the deviation in terms of cooling demand and ventilation losses with SIM_BO.
PHPP calculation is based on monthly balances therefore only one constant ventilation value for the summer and one for the   winter can be used, disregarding the occupancy schedule.In practice, describing the ventilation with detailed characteristics (e.g.mechanical ventilation with heat recovery of this case study) might be not possible because it is extremely complicated to guess the corresponding ventilation rates without the results of the dynamic simulation as a reference.On the contrary, if the mechanical ventilation system is not yet defined it might be easier to model it with PHPP instead of tools requiring detailed inputs.Table 2.8 reports the ratio between the solar gain and the solar radiation impinging the south glass area as a result of the SIM_BO simulation and the input used in PHPP.The SHGC of the window, as well as the shading and dirt factor, are included in this ratio.The shading coefficient in PHPP is during the winter period 0.60, 0.50, 0.48 in Rome, Stuttgart and Stockholm, respectively and during the summer period 0.58, 0.38, 0.45 in Rome, Stuttgart and Stockholm, respectively.

Statistical indices
In this work, statistical indices were used to quantify the deviations between the results of the different tools.This evaluation was conducted using monthly and hourly datasets for all the components of the energy balance (heating, cooling, ventilation plus infiltration, transmission losses and solar gain) as well as for the convective temperature.Since no absolute reference such as measurements is considered within this work, it is important to define a benchmark against which the results can be compared and the median value of all tools will be used for each parameter.
ASHRAE Guideline 14-2014 [42], FEMP [43] and IPMVP [44], describe a method for the validation of the building model against measurement and suggest limits for the NMBE shown in Eqs.2-14, NRMSE (using the average as a normalization means) in Eqs.2-18 and the R 2 in Eqs.2-19 to verify the accuracy of the models.The calibration criteria given by these standards are: ±5% for the monthly NMBE, 15% for the monthly NRMSE, ±10% for the hourly NMBE, 30% for the hourly NRMSE and > 0.75 for the R 2 ([47], Table 1).
In Eqs.(2-13) to (2-19) r i represents the reference value for the i th time step, calculated as the median of the results of all the tools in each considered time step (i.e.hourly or monthly), while s i is the simulated value for a particular tool at the i th time step, N is the number of considered data, r is the average of the reference values (i.e., median of all the tools for each time step) and nm is a normalization means.All the indices reported in Table 2.9 are calculated for each tool using hourly or monthly data.The MBE and its normalized value NMBE are good indicators of the overall bias but suffer from the cancellation effect, therefore at least one additional index is needed.ASHRAE Guideline 14-2014, FEMP and IPMVP suggest using the NRMSE based on the average value as a normalization means.The Mean Absolute Error (MAE) also provides analogous information as the RMSE, but is easier to interpret since each deviation influences linearly the MAE while the RMSE is more sensible to the outliners.
The RMSE is scale-dependent [78] therefore can be used as a measure of accuracy for a specific dataset but not between different datasets.The normalization of the RMSE eases the comparison between datasets with different scales but it should be computed only for data based on a scale with an absolute zero (e.g., Kelvin, not Celsius or Rankine, not Fahrenheit).Although the average has been commonly used and is suggested by [42,43] and [44] as a normalization means, this is not an appropriate method when the dataset contains a large number of zeros since the average is then close to zero and the NRMSE is no longer meaningful as highlighted in [78].Therefore, it is noteworthy to mention that the normalization process may be based on different normalization means, such as average, range, interquartile range (not affected by outliners) or standard deviation (affected by outliners).Inconsistencies and problematics related to the application of statistical indices are explained in detail in Section 3.1.

Evaluation of the computational cost
For the evaluation of the computational cost, each tool runs the corresponding model on the same local workstation with the following specifications: Processor: Intel Ò Core TM i5-8350U CPU @ 1.7 GHz; 4 physical cores; 8 logical processors; RAM: 16.0 GB; OS: Windows 10, 64 bit.
During the simulation, no other applications were running in the background except the operating system.Each computation was repeated 10 times and the median value of the CPU times was used for the comparison.Since the outputs are treated differently in each tool, the computational cost was evaluated including and excluding the processing of the output to analyse the computational cost of the model itself and to be able to address the additional time needed for the preparation of the output.28 days of pre-simulation time is considered for all the tools.Table 2.10 reports the simulation set-up used in each tool.Figure 3.7 shows the comparison of the runtimes.

Results and discussion
In this section, inconsistencies and problems related to the application of statistical indices are explained and the monthly and hourly results of all the tools are reported together with their deviations.In addition, the results of the computational cost are presented.Furthermore, the influence of the parametrization process on the results of IDA and PHPP is reported.

Challenges related to the usage of statistical indices
The statistical indices introduced in Section 2.5 were analysed to explain their limits and their application within this work.The literature review highlighted that different approaches are applied for the determination of deviations in the building calibration and validation field and they differ mainly by the statistical indices that are used, and the time frame and reference variables on which they are calculated.Another aspect that is not often mentioned is the normalization of the statistical indices, which can be problematic especially when the normalization approaches zero.To show these aspects, the statistical indices were calculated using the deviations of the convective temperatures and heating power between the

Non-normalized indices
Normalized Indices  3.1b applied to the convective temperature, and Figure 3.2b as applied to the heating power.
The normalization of the RMSE was calculated using the average (av), average of the absolute value higher than zero (av > 0), the range, the interquartile range (iqr) and the standard deviation (std).In addition, the statistical indices were calculated across different time periods using an hourly resolution.These time periods can vary from a few days (e.g., periods in which measurements are available) [16], to a month [46] to a year [5].Here the results are presented considering monthly and annual time frames.
Figure B.4 shows the distribution of the absolute deviation of the internal convective temperature for all the tools considering hourly data for a one-year simulation.It can be noticed that the deviation range for SIM_IBK is ± 0.5 K and only for 2% of the time which can be considered as a good agreement.Figure 3.5b shows a box plot of the reference convective temperature distribution for each month and considers the data set of the whole year.Figure 3.1a shows the MBE, MAE and RMSE for each month and also considers the data for the whole year.It can be noticed that the MBE is the only indicator giving information about the sign of the deviation and that the MAE and RMSE follow the same trend but the RMSE is always higher than MAE since it is more sensible to the outliners.Figure 3.1b reports the normalized indices and the R 2 , in this case, NMBE, NME and NRMSE_av are ranging between À0.02% and + 0.1%.The same indices calculated using the temperature in Celsius would vary between À0.27% and + 1.3%.Since the internal convective temperature is always higher than zero, NRMSE_av is equal to NRMSE_av > 0. In this case, it is not possible to normalize the RMSE for the iqr or std during the winter and summer months since the temperature is close to the set point most of the time (as highlighted in Figure 3.5) therefore iqr and std are set equal to zero.These normalizations could only be used when the hourly data of the whole year were considered.The NRMSE_range can also be misleading since this index would be higher in months where the temperature is constant (i.e., range close to zero).The R 2 is lower in the months where the temperature is close to the set point (January, June, July, August and December) since in these periods the reference temperature is always near to the average temperature (see Eq. (2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19)).
Figure 3.5 shows the distribution of the reference heating power used for the calculation of the SIM_IBK deviation.
Figure 3.2 shows the indices considering the hourly data of heating power on monthly and yearly time frames.In addition, one additional column is added considering hourly data for the whole year only when the heating system is active.The monthly absolute deviation indices (i.e, MAE and RMSE) tend to be lower in April and May where the heating system is working less.The normalization of the statistical indices is also non-trivial for the heating power, in fact in April and October (see Figure 3.2b) where the heating power trends toward zero almost all the normalization means trend to zero as well (std, iqr, and av).This leads to really high relative deviations.The only normalization means that allow a stable evaluation of the deviation with relative indices is the range and av > 0. Another reasonable option might be to exclude transition periods (i.e.April and October) but this would require a subjective definition of the threshold under which the heating demand is too low for the calculation of the deviations.This problem arises also when a building has either a high envelope quality or is placed in a warm climate, i.e. if it has a really low heating demand (e.g. in this study for the climate of Rome).
By considering the whole year as a time frame, the NRMSE_av is 31.1% which would even be outside the thresholds suggested by ASHRAE Guideline 14-2014 [42] even though looking at Figure B.3a it can be noticed that SIM_IBK records deviations in heating power only 1% of the time of the whole year simulations and the deviation range between ± 5 W/m 2 .The problem, in this case, is that the average heating power for the whole year trends towards zero since in many periods the building does not need any heating power (see Figure 3.5a).Calculating the deviation including only the heated periods (annual > 0), leads to a much lower NRMSE_av (18.2%), thanks to a higher mean value (see Figure 3.5a), even though the RMSE is higher since the number of samples is lower (see Figure 3.2a).
This approach would require selecting the periods over which the deviations are calculated leading to a subjective decision process.Another possibility is to consider all the available data and normalize the sum of the squared errors with the average of the reference values including only reference data higher than zero (NRMSE_av > 0).
The NMBE is not influenced by this problem since the sum of the reference values (i.e. in this case reference heating powers) at the denominator does not change considering or not values equal to zero.
After this analysis, it can be stated that: The MBE/NMBE is needed to show the sign of the deviation, but another index has to be used since the MBE suffers from the cancellation effect; The RMSE and MAE provide similar information but the RMSE is more sensitive to the outliners than MAE where each error increases linearly the MAE; The R 2 is not a good index when the analysed variable is mainly constant; The normalization of the statistical indices is a complex step and has to be performed carefully.In the current work this work the NRMSE_av > 0 will be used for the analysis of temperatures, power and energy; Having a larger dataset (hourly data for the whole year) makes the evaluation of the deviations more robust.Therefore when hourly data are considered the indices should be calculated over the whole year; The absolute thresholds suggested by ASHRAE Guideline 14-2014 can lead to unjust conclusions regarding the validity of cases with low energy demand (i.e.high-quality buildings or warm climates).

Results of the tool Cross-Comparisons
In this section, the monthly and hourly results of all the tools is presented that considers all the climates (i.e.Rome, Stuttgart and Stockholm) and the corresponding deviations.

Monthly results
Figure 3.3 shows the monthly heating (HD) and cooling (CD) demands, solar gain (SOL), ventilation summed with the infiltration (INF + VENT) and transmission (TR) losses for all the tools except for the PHPP where only HD and CD are reported, for climates of Stockholm, Stuttgart and Rome.The median value of each component of the thermal balance is presented in Figure 3.3 as a solid black line.The monthly values of the internal gains are excluded from the figure because all the tools consider the same monthly energy.Figure 3.3 shows that an overall good agreement between all the tools is reached, after the parametrization phase (see Sections 2.4 and 3.4).The PHPP was aligned with the other tools in the coldest winter and hottest summer months, but it could not exactly predict the HD and CD during the interim season.In all the locations, SIM_IBK had slightly lower infiltration and ventilation losses during the first part of the year.Considering the climate of Stockholm, DAL and EP had slightly higher solar gains and MOD slightly lower than the median value.Regarding the ventilation and infiltration losses, EP recorded lower losses in winter and slightly higher in summer, while IDA had higher ventilation losses in summer with respect to the median value.The effect of the capacity of the adiabatic structures generated deviations in the transmission losses in spring and autumn especially for the climate of Stockholm.In Stuttgart, EP was slightly above the median value concerning the solar gains, while IDA lower.MOD had slightly higher solar gains during winter and DAL during the summer with respect to the median value.Regarding the climate of Rome, IDA had higher solar gains during the whole year compared to the other tools and DAL only during summer.
All the component of the thermal balance are acting together and the results are the heating and cooling demands.In this specific case study, the control of the ventilation system allows to reduce the overheating problem (or cooling demand) compensating higher solar gains with increased ventilation losses.

Hourly results
Figure 3.4 shows the hourly average convective temperature (# c ), ventilation and infiltration losses ( _ Q InfþVent ), solar gain ( _ Q SOL ), heating ( _ Q HD ) and cooling power ( _ Q CD ) for all the dynamic simulation tools for climate of Stockholm, for four different periods (48 h each) to represent the typical behaviour in all the seasons.From the convective temperature plots, it can be noticed that DAL is responding slower than the other tools in the free-floating periods since it considers only one lumped capacity.The convective temperature in IDA is reacting faster than other tools and in MOD it is rising slightly earlier than other tools during the reported spring period (this is not happening systematically every day).Regarding the ventilation losses it can be noticed that DAL and IDA are starting one hour earlier than other tools in spring, summer and autumn since they consider the daylight-saving time.In addition, EP has a slightly different control of the ventilation heat recovery bypass, which is activated slightly later than other tools in autumn and spring.The results show that solar gains are in good agreement as well as the cooling demand, even though DAL has sometimes higher peaks of the solar gains compared to the other tools due to the different sky model and shading control.holm considering monthly and annual data.In Figure 3.5a there is an additional box plot of the heating power considering the whole year, for those periods when only the heating system is activated (''annual > 0").It can be noticed that the median heating power is equal to zero and the mean is close to zero if the whole year is considered (i.e., ''annual") while excluding the periods in which the heating system is switched off (i.e., ''annual > 0") the median is close to the mean value and they are both higher than zero.Problems might arise from this aspect concerning the normalization of the statistical indices as reported in Section 3.1 in more detail.

Deviations
Figure 3.6 presents the NMBE and NRMSE_av > 0 calculated on an hourly (h) and monthly (m) basis using the energy gain and losses associated with heating, cooling, infiltration, ventilation for the climate of Stockholm (a), Stuttgart (b) and Rome (c).The thresholds 6 for the NMBE and NRMSE are reported in Section 2.5 and are shown in Figure 3.6 using dot-dash lines.It is noteworthy, that PHPP only calculates heating and cooling demand on a monthly basis as standard output.
For the climate of STO and STU (see Figure 3.6a and b) DAL exceeds the threshold of 30% for the hourly NRMSE calculated on the solar gains, though the other indices are clearly within the thresholds.This is due to the combined effect of the different sky model used in DAL and the calibration of the shading threshold which allow reaching the same solar gain from the energy point of view, but with a different hourly distribution with respect to the other tools (see Figure 3.4 and Figure B.1). IDA slightly exceeds the solar gains NMBE_m in STU even though the other components of the thermal balance remain acceptable, while in ROM the higher solar gains also cause high ventilation losses and cooling demand, which leads to the thresholds (i.e., NMBE_m for solar gains and cooling demand and NMBE_m/h, NRMSE_av > 0_m/h for the ventilation losses) being exceeded.In ROM also SIM_IBK slightly exceeds the NMBE_m for the ventilation losses, even though all the other components of the balance are within the thresholds.
The results for monthly heating demand in ROM show that TRN, SIM_BO and IDA are within the thresholds while EP, DAL, SIM_IBK, MOD, DAL and PHPP are not.However, the HD in ROM is so low that it could be disregarded.In most cases, the hourly NRMSE_av > 0 is higher than the monthly value, while the hourly and the monthly NMBE deliver the same information.The results show that it is important to take both the NMBE and the NRMSE into consideration since in some cases the overall annual energy gains and losses might be in agreement, but the deviations considering hourly data are not.For example, the calculation of these indices for the solar gain in DAL can be observed.The hourly NRMSE_av > 0 calculated for the ventilation losses is particularly high for IDA and DAL because these two tools are the only two considering the shift between the daylight saving time and summer time, which leads to a shift in the schedule of the ventilation system (as can be seen in Figure 3 For sake of better readability, R 2 was not reported in the graph but is always above 75%, (the minimum required by ASHRAE Guideline 14-2014), except for the heating demand in ROM for DAL where it is equal to 61% and for the ventilation losses of DAL in Stockholm, Stuttgart and Rome where it is equal to 67%, 64% and 72%, respectively.Concerning the convective temperature, the agreement between the tools is always within the thresholds in all the climates.
Overall, it can be stated that a good agreement between the tools is reached and that all the models are quite reliable after the parametrization process.

Computational cost
In Section 2.6 the boundary conditions in which the computational cost of each tool is calculated are defined.The PHPP is an instantaneous calculation tool therefore the fastest and is not included within this comparison.Figure 3.7 shows the computational cost for each tool without the post-processing of the output (in blue) and including it (in red).The filled diamond represents Box plot of the reference (i.e.median value of all tools for each time step) (a) heating power on monthly basis, for the whole year (i.e., ''annual") and for the whole year considering only periods where the heating system is switched on (i.e.''annual > 0"); (b) convective temperature on monthly basis and for the whole year (i.e., ''annual").The red lines represent the median and the black rhombuses the mean.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)the relative additional time required when the output file needs to be prepared.The number and format of the outputs are different in the different tools.This is important to consider because the number of outputs considered is a user-defined variable that depends on the level of detail the user needs to analyse the results.For these reasons, the computational time is shown with and without output-processing to make a fair comparison of the calculation effort required by the model itself and, at the same time, to give an insight into the time required for the simulation to generate the output file.
The computational time of MOD is comparable with IDA, while EP and DAL can run an annual simulation in only a few seconds.IDA ICE can decrease to half the simulation time by splitting the annual simulation into various time slices, assigning them to different processing cores in the computer, while the other tools (TRNSYS, Simulink, Modelica and EnergyPlus) can profit from the multiple cores when more simulations run simultaneously (e.g., parametric sweep), but not on a single simulation for this model configuration.Among the investigated tools, DAL is the fastest but it must be highlighted that is the only tool based on an hourly time step and can be used only for predefined simple geometries (i.e., a shoebox) for which the view factor-and daylight matrices are pre-calculated.Although EP is one of the fastest tools, it requires a similar amount of time, for the elaboration of the output, than TRN and SIM_BO.The MOD and IDA simulations are faster than the other tools for the preparation of the output file.
Another aspect that is not depicted in Figure 3.7 is the time required for setting up the model.TRNSYS, IDA ICE, user interfaces for EnergyPlus, both Simulink libraries and PHPP can import the geometry information from a 3D drawing (i.e., gbXML, idf, dxf, IFC, etc. ..), easing the definition of the building model.For Modelica/Dymola such features are currently being developed within IBPSA project 1 [80].Nevertheless, the requirements of the 3D drawing are different for the different tools therefore the interoperability is not straightforward.Though, this problem is supposed to be addressed by the improvements in the field of BIM to BEM in the near future.

Results pre and post-parametrization for DAL and PHPP
This section describes the effects of the parametrization process (see Section 2.4) on the results of PHPP and DAL, reporting the results of the non-parametrized simulation (i.e., DAL_online and PHPP_nonpar) and of the parametrized simulation (i.e.DAL_par and PHPP_par), which correspond to the final results of DAL and PHPP reported in Sections 3.2.1,3.2.2 and 3.2.3.
With pre-design tools, it is not straightforward to model aspects such as hourly internal gain profiles, dynamic shading and ventilation bypass control as required by the description of this case study.The average value of the power due to the internal gain (see Figure 2.6) can be easily determined by averaging the hourly profile over the day for DAL (19.7 W/m 2 ) and over the week for the PHPP (6.5 W/m 2 ).The standard effective ventilation rate needs to also be recalculated for both PHPP and DAL online since both tools are based on a constant rate while, according to the description of the building the ventilation system, is switched on only during the working time.The additional ventilation rate due to the dynamic control of the bypass is considered in DAL online while in PHPP it is only possible to set this option for the summertime since the control is based on monthly ambient temperature instead of hourly as in DAL.The shading can be controlled as required by this case study in DAL online but the solar irradiation on the south facade used by DAL is different from the other tools since an isotropic sky model is applied instead of the Perez model, therefore the solar gains are also different.In PHPP only a summer and winter shading value can be given as input and in this case study, it would be difficult to guess this constant without knowing the results of the other dynamic simulation tools.
In PHPP_nonpar the standard assumptions are used during winter (25% reduction) and the reduction factor described in the report (70% reduction) is used for the summertime as temporary sun protection.In DAL online and PHPP_nonpar, the HTC of wall and window described in the report are used.DAL_par compared to DAL_online implements an improved ventilation model where only during the occupied time the ventilation system works, moreover the shading control threshold, the HTC of the façade and the building capacity are parametrised against the results of the other dynamic simulation tools.
PHPP_par implements a parametrized shading coefficient for summer and winter, ventilation rate and HTC (see Table 2.8).Table 3.1 reports the annual heating demand (HD) and cooling demand (CD) of PHPP and DAL pre and post parametrization, showing the relative deviation against the reference HD and CD (median of the results of all the tools).Figure 3.8 shows the monthly heating and cooling demand, infiltration plus ventilation and transmission losses, solar gains and average convective temperature simulated with DAL and PHPP pre and post parametrization and the reference value derived as a median of the results of all the tools for the climate of Stockholm.PHPP delivers only monthly heating and cooling demand, all the other components of the energy balance can be extrapolated by the user, but especially in spring and autumn, this process is non-trivial.Therefore,  3.1).

Lessons learned
Numerous iterations were necessary to reach a good agreement among the different tools even though the case study is well described in a comprehensive report and the reference building is geometrically simple.The challenges encountered in this process are only partly avoidable since user mistakes are difficult to foresee.Part of the challenge could be prevented by having a more detailed description of the building, avoiding misinterpretation and the entire parametrization process.An important issue that must also be highlighted is that since the tools implement different models, the outputs and the required inputs are defined differently.Defining equivalent input parameters for each model is a non-trivial task considering that different models describe physical phenomena using different levels of abstraction and at different levels of scale.A lesson that the authors derived from the parametrization process is that it is recommended to define input parameters starting at the most detailed scale and the lowest level of abstraction amongst the tools that are going to be compared.Using the window system as an example, the process started with a reference office description where the window was described using properties at the level of the glazing system (TRN, SIM_BO) and detailed properties at the level of the individual window pane (EP, MOD) had to be derived through an exhaustive search process.The process of defining equivalent input parameters could have proceeded faster, however, if properties were originally defined at the window pane level and then used to derive equivalent properties at a higher level of abstraction based on subsystem simulation.
The window example also illustrates the problem of aligning simulation outputs.In tools with a simplified window, only the total solar gain can be given as output, while in other tools solar gains are reported using only directly transmitted solar radiation or the solar radiation absorbed at interior surfaces.Here, clearly defining the boundary conditions of different energy contributions was an essential part of aligning the simulation results of the different tools.The following paragraphs will give an overview of the user-related input errors and modelling decisions that were identified throughout the calibration process as sources of discrepancies between the different tools.Additionally, the magnitude of these discrepancies at different stages of calibration will be presented.
User mistakes experienced within this case study, were as follows: Flipped order of the layers of surface constructions; Wrong interpretation of the HTC of the window (HTC-glass instead of HTC-window).HTC-window is the results of the area-weighted HTC of the frame and the glass; Wrongly selected weather file where different data for the same location is used; Wrong starting day of the internal gain profile; and Wrong wall thermal capacity.
Wrong interpretations of the office description that could have been prevented by a more detailed description were made regarding the: Control of the mechanical ventilation; Control of the shading; Ventilation air volume flow; Heat recovery from fans of the ventilation system; and Control of the pre-heater used to avoid ice formation in the ventilation heat recovery system.Moreover, information that was initially missing in the report were implemented differently in the different tools, i.e.: Air density (as a function of temperature or constant); Absorption and emission coefficient of opaque structures; and The convective exchange coefficients (as a constant or function of the temperature difference); Comparing only the modelled heating and cooling demand, was not enough to find the reasons for the deviations between the tools.Often, higher gains might be balanced by higher losses hardly affecting the heating and cooling demand and therefore the zone heat balance was investigated.Additionally, isolating particular heat transfer phenomena assisted in identifying the cause of deviations.For instance, simulations were executed: Without solar radiation; Without windows; With a permanently activated, or no shading device; Using constant convective heat transfer coefficients, identical in all models, rather than the detailed algorithms available in some tools; Without internal gains; With different modelling approaches of the adiabatic structures; Without ventilation system.All these cases were not executed with all the tools, but only where deviations needed to be identified with further analysis.Along with the different tests, an increasing number of outputs were analysed and compared (e.g., Internal surface temperatures, all the details of the balance of windows and walls, all the components of the weather data, distribution of the internal gains and solar gains, ventilation and infiltration losses, etc. ..).
The window model has a key role, in this case study, since it represents the main source of transmission losses.Therefore, the parametrization of the input for both simplified and detailed window models was a time-consuming process.Within this work, using the given constant HTC for the window in Stockholm deliver an HD 37% lower than using the detailed window model.Table 4.1 reports the annual heating (HD) and cooling (CD) demand of each tool for the climates of Rome (ROM), Stuttgart (STU) and Stockholm (STO) in the initial iteration (V1), an intermediate step (V2) and the final results presented in this paper (V3).For each case, the maximum and minimum annual NMBE are reported and on the right-hand side, the dispersion of the annual HD and CD is represented by box plots for each iteration.The first iteration is represented by V1, here the dispersion of the results is important since many user mistakes and wrong interpretation of the office description are present.Within the V2 the situation is improved, but still not acceptable since the spread of the results is still high.
With V3 a good agreement is reached thanks to the recognition of the user mistakes and the parametrization of the window model.From V1 to V2 the window properties were parameterized in EP, the set point of the anti-freezing was corrected in TRN, the shading model, the volume of the TZ, the starting day of the occupancy profile, the adiabatic model of the opaque structure and the weather file were corrected in SIM_IBK, and the HTC of the window in DAL.From V2 to V3 an improved parameterization of the window properties was carried out for MOD and EP, the convective coefficient calculation was modified to use the same equations in EP, TRN and MOD.In SIM_IBK was introduced the window model from the library of SIM_BO and the order of the construction layers of the adiabatic ceiling and floor was corrected in MOD and TRN.

Conclusions
In this work, the modelling approaches of well-known dynamic simulation tools, EnergyPlus, TRNSYS, IDA ICE, Modelica/Dymola the new Matlab/Simulink libraries, CarnotUIBK and ALMABuild as well as the predesign tools DALEC and PHPP were described and the tools compared against each other using a typical office cell located in Rome, Stuttgart and Stockholm as a reference building.To compare the results of the different tools, commonly used statistical indices and normalization means were analysed and finally the Normalized Mean Bias Error and Normalized Root Mean Square Error were used to assess the degree of agreement of the results with the median value used as a reference.The thresholds suggested by ASHRAE Guideline 14-2014 were used as a basis for this evaluation.The normalization process of the statistical indices was non-trivial and the encountered challenges were presented and thoroughly highlighted within this paper.
After many iterations, it can be stated that overall a good agreement between all the dynamic tools was reached.The annual results from the PHPP program was also aligned with the annual results of all the other tools.At the beginning of this process, the relative deviation of the heating and cooling demands predicted by the different tools was from + 61% to À34%, which was reduced to + 7% to À5% after many simulation rounds (excluding the heating demand in Rome which is almost negligible).The deviations that were experienced were mainly due to difficulties in defining equivalent input parameters for the models based on different approaches; user mistakes; and misinterpretation of the building description.User mistakes were clearly difficult to avoid, while different interpretations of the building description were mainly caused by an insufficient description.At the same time a tedious parametrization phase was required for those tools which were either more simplified (e.g., PHPP and DALEC) or more complex (e.g.Modelica and EnergyPlus) compared to the building description that was written using TRNSYS as a reference.The parametrization of inputs such as the ventilation rate and constant shading allowed the comparison to reach a good agreement even for simplified tools like PHPP, where dynamic control logics (i.e., shading and ventilation control) required for this case study, cannot be modelled.The results of DALEC and PHPP pre and postparametrization for the climate of Stockholm were reported and the deviation of PHPP annual heating demand compared to the reference median value was reduced from À64% to À2% and from À11% to 1% for the cooling demand.The different tools involved in this study allow analysis with different degrees of detail and at the same time, they have very different computational costs.An overview of the modelling approaches available within the different software packages was given and the computational time required by each model used for this case study was reported, to support users in choosing a simulation tool that fits their purpose.
The computational cost was quantified by running the simulations using the different tools on the same computer.DALEC and PHPP were both almost instantaneous.However, these were also simplified compared to the other simulation software packages.EnergyPlus ran for about 5 sec and was faster compared to the other tools.TRNSYS and ALMABuild were in the same range around 20 sec, whereas Modelica/Dymola and IDA ICE were slightly slower with 38 and 33 sec respectively.CarnotUIBK was the slowest with 67 sec.The histogram reporting the distribution of the deviations can give precious information regarding the sign of the deviation and how often it occurs.

CRediT authorship contribution statement
Figure B.3 shows the distribution of the absolute deviations of the heating power for all the tools considering the climate of Stockholm including the whole year (a) and only periods in which the heating system is active (b).It can be noticed that the distribution of the deviations for the different tools is the same in graphs a and b but the frequency of the deviation is drastically higher when only periods in which the heating system is working are considered.

C. Summary of reference office cell description and boundary conditions
In Table C1 a short description of the office building inputs and of the applied boundary conditions is provided.

j o u r
n a l h o m e p a g e : w w w .e l s e v i e r .c o m / l o c a t e / e n b Figure 5.4.1-7)

Figure 2 . 7 .
Figure 2.7.PHPP performs only monthly balances and the shading coefficient can be given only as a constant that might be different for the summer and winter balance (see Table2.8).Therefore, the graph reported in Figure2.7 would be completely black for the PHPP and the value represented by the black colour would be the constant shading coefficient given as input to the PHPP.The same applies to the control of the bypass of the ventilation heat recovery.Concerning the solver settings, the calculation time step can be defined as constant or can be variable in Simulink, MOD and IDA while it is fixed with the time length defined by the user in TRN .3, lines: Window thermal model and Window optical model, column SIM_BO) has been used also in the SIM_IBK model and the convective and radiative exchange coefficients of SIM_IBK

Figure 2 . 4 .
Figure 2.4.Sketch of the window models addressing the solar distribution and the thermal resistance between the panes.In (a) a thermal balance over each pane is performed and the solar gains are distributed over each pane and in (b) a constant thermal resistance is used and the total solar gains are calculated using the SHGC.

Figure 2 . 6 .
Figure 2.6.Total internal gain profiles, including occupancy, lighting and appliances, used in the PHPP (as a constant average value), in DAL (as a daily profile) and all the other dynamic simulation tools (as hourly profiles).

Figure 2 . 7 .Figure 3 . 1 .
Figure 2.7.Hourly shading control simulated with SIM_BO for the climate of Stockholm.The blue and red lines represent sunrise and sunset time, respectively.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Figure 3 . 2 .
Figure 3.2.Non-normalized (a) and normalized (b) statistical indices applied for the calculation of the deviation between the simulated heating power with SIM_IBK and the reference heating power calculated as the median value of all tools for each time step, for the climate of Stockholm.

Figure 3 . 3 .
Figure 3.3.Monthly heating (HD) and cooling demand (CD), ventilation and infiltration losses (VV), transmission losses (TR), solar gains (SOL) and median values for the climates of Stockholm (a), Stuttgart (b) and Rome (c) for all the tools.
Figures for the climates of Stuttgart (STU) and Rome (ROM) are reported in Appendix B (Figures B.1 and B.2).

Figure 3 .Figure 3 . 4 .
Figure 3.4.Hourly results (i.e., convective temperature, ventilation plus infiltration losses, solar gain, heating and cooling power) for all the dynamic simulation tools for the climate of Stockholm for winter, spring, summer and autumn periods (x-axis is hour of the year).

Figure 3 .
Figure 3.5.Box plot of the reference (i.e.median value of all tools for each time step) (a) heating power on monthly basis, for the whole year (i.e., ''annual") and for the whole year considering only periods where the heating system is switched on (i.e.''annual > 0"); (b) convective temperature on monthly basis and for the whole year (i.e., ''annual").The red lines represent the median and the black rhombuses the mean.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Figure 3 . 7 .
Figure 3.7.Computational time of each tool without the processing phase of the output (in blue) and the additional time required for the output preparation (in red).The dots represent the relative additional time when the outputs are prepared.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Figure B2 .
Figure B2.Hourly results (i.e.convective Temperature, ventilation plus infiltration losses, solar gain, heating and cooling power) for all the dynamic simulation tools considering the climate of Rome and 4 representing periods for winter, spring, summer and autumn.
Figure B.4 shows the distribution of the absolute deviations of the convective temperature for all the tools considering the climate of Stockholm.From Figures B.3 and B.4 it can be easily noticed that IDA and EP reported a non-symmetrical distribution of the deviations.In particular, IDA reported a lower convective temperature (À0.5 K) more often than other tools.Starting from this information and analysing the hourly plot for the whole year, it was possible to see that the convective temperature of IDA in summer during the night is decreasing slightly faster than the other tools.For EP it can be noticed that the distribution of the convective temperature deviation (see Figure B.4) is shifted towards the positive side and the distribution of the deviation of the heating power is shifted towards the negative side (see Figure B.3). Analysing the hourly plot for the whole year, it was possible to notice that the ventilation control of EP on some days during winter caused lower ventilation losses with respect to the other tools.The peaks of the heating power of EP, visible in Figure 3.4 (hour 775), caused the frequency peak correspondent to 10 W/m2 in Figure B.3.The high dispersion in both convective temperature and heating power deviations for DAL is due to the simplified capacity model used within this tool (see Sections 2.3, 2.4, and 3.2.2).

Figure B3 .
Figure B3.Absolute deviation of the hourly heating power for all the tools for one-year simulation in the climate of Stockholm, (a) considering the whole year, (b) considering only periods in which the heating system is working.The deviation is calculated as the hourly heating power of the tool minus the hourly reference power (median of all the tools).

Figure B4 .
Figure B4.Absolute deviation of the hourly convective temperature for all the tools for one-year simulation in the climate of Stockholm.The deviation is calculated as the hourly convective temperature of the tool minus the hourly reference convective temperature (median of all the tools).

Table 2 . 1
Main boundary conditions: yearly average ambient temperature (# [82] applied in the present comparison.A preliminary study including the comparison of SIM_BO, EP and TRN modelling the office cell equipped with photovoltaic panels and air to air heat pump is presented in[82]Section 2.1.5.

Table 2
.3 and Table 2.4) is considered as a constant (Table 2.4: C1/C2) in PHPP and DAL.Contrariwise EP, TRN, IDA and MOD offer both constant

Table 2 . 2
Main properties of the south-oriented façade.
4Not included in the current study.

Table 2 . 3
[72]ary of the features of the mathematical models employed in each tool within this case study (black points) or are available (grey points).EP, IDA and MOD require as inputs the solar reflectance for each side of each pane and the solar transmittance of each pane, while TRN and SIM_BO require only the solar absorption on each pane and the overall solar transmittance calculated using an external subsystem simulation (LBNL-Window[72]).
1 (see Table 2.3, lines: Convective heat transfer and Radiative exchange, column SIM_IBK) have been used in SIM_BO.

Table 2 .
6HTC glass and façade reported in the report, calculated with SIM_BO and used in DAL, PHPP and IDA.

Table 2 . 4
Internal and external convective (subscripts ci and ce) and radiative (subscripts ri and re) exchange equations where: K is a constant heat exchange coefficient [W/(m 2 K)], v is the wind speed [m/s], T and # are temperatures in K and °C respectively, F ij is a view factor between the surfaces i and j, H is the radiosity [W/m 2 ], e is the emissivity and r the Stefan-Boltzmann constant 5.670 373 (21) Â 10-8 [W/(m 2 K 4 )].The subscripts C and R mean convective and radiative, si and se are the internal and external surfaces, amb is ambient and gnd is ground.

Table 2 .
5Parameter set that needed to be adapted for each tool.

Table 2 . 7
Effective volume flow of the mechanical ventilation system and volume flow when the bypass is activated described in the report and used in DAL and the average volume flow used in PHPP for the winter and summer balance.

Table 2 . 8
Winter and summer ratio between solar gains and incident solar radiation derived from Simulink and applied in PHPP.

Table 2 .
10Simulation settings for the different tools.EP implements different algorithm for the numerical solution of the thermal zone and of the HVAC system.The time step is fixed for zone heat balance calculation and variable for HVAC system simulation. 1

Table 3 . 1
Annual Heating and Cooling demand (HD, CD) calculated with DALEC and PHPP pre and post parametrization compared against the reference heating and cooling demand (median of all the tools), for the climate of Stockholm.