Data-driven building archetypes for urban building energy modelling

This paper presents an approach for using rich datasets to develop different building archetypes depending on the urban energy challenges addressed. Two cases (building retrofitting and electric heating) were analysed using the same city, Stockholm (Sweden), and the same input data, energy performance certificates and heat energy use metering data. The distinctive character of these problems resulted in different modelling workflows and archetypes being developed. The building retrofitting case followed a hybrid approach, integrating statistical and physical perspectives, estimating energy savings for 5532 buildings from seven retrofitting packages. The electric heating case provided an explicitly statistical data-driven view of the problem, estimating potential for improvement of power capacity of the local electric grid at peak electric power of 147MW. The conclusion was that the growing availability of linked building energy data requires a shift in the urban building energy modelling (UBEM) paradigm from single-logic models to on-request multiple-purpose data intelligence services. © 2019 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).


Introduction
Energy conservation in the existing building stock is the cornerstone for reducing primary energy consumption and greenhouse gas (GHG) emissions in cities [1]. In order to meet ambitious energy efficiency and climate mitigation targets, cities need to understand current energy demand and future effects of various retrofitting decisions. At the same time, the electrification of heating is recognised as one of the main technological changes driving reduction of GHG emissions within the heating sector [2]. However, the increasing grid loads associated with space heating impose higher requirements on feeder and transformer capacity, affecting the power quality, capacity, cost and regulation requirements in the distribution networks [3].
Various energy models are used for planning and decision support for strategies on building stock retrofitting and securing electric power supply. However, the conventional energy models are either limited in detail or too exhaustive to be applied on a large scale [4]. At the same time, emerging urban building energy models (UBEMs) use growing volumes of urban energy data and significantly reduce the amount of effort needed from human modellers [5]. UBEMs provide automated generation of building energy models through abstraction of building stock by different 'building archetypes', i.e. sample or virtual buildings that characterise subsets of buildings of the same kind [6]. However, replacement of the whole archetype building subset by a single building induces a risk of oversimplification, resulting in poor quality of the modelling output. Availability of rich datasets on factual heat energy use opens the way for approaches to generating various building archetype subsets based on physical and statistical models of the city building stock [7].
The aim of this study was to demonstrate how using rich datasets allows different building archetypes to be developed depending on the urban energy challenges addressed. It builds on two cases from the same city, Stockholm (Sweden), using the same input data, energy performance certificates and heat energy use metering data. The purpose of Case 1 was to identify the potential for energy saving through large-scale building energy retrofitting. The purpose of Case 2 was to explore the current usage of electric heating and identify its potential for reducing the electricity power demand. A UBEM for Stockholm with heating as the main scope was developed in both cases, but the distinctive character of the problems addressed resulted in different modelling workflows and archetypes being developed. In Case 1, future effects of retrofitting were estimated, while Case 2 provided statistical aggregation for building electricity use for heating that exists now. Hence, Case 1 followed a hybrid approach, combining statistical (energy signatures) and physical (building energy simulations) modelling, while Case 2 corresponded to a statistical view of the problem built explicitly on currently available data. Regarding the archetypes developed, Case 1 did not have a requirement to cover the whole city. Instead, the focus was narrowed down to three archetypes corresponding to the most typical classes of buildings, ensuring feasibility of the retrofitting measures analysed. In Case 2, all buildings using electric heating were the focus and the aim of archetyping was to provide several levels of detail, allowing for comprehensive exploration of the power capacity issue.
The study examined whether integration of various building energy datasets, such as energy performance certificates and metered heat energy use, allows for greater flexibility in the choice of UBEM, resulting in better alignment of the archetypes constructed to the problem addressed and consequently better decisions.

Urban building energy modelling
Rapid development of computing and sensor technologies has created a 'big data' challenge to running building energy simulations in a conventional way, which calls for new approaches for handling and utilising the data [8]. A comprehensive overview of types of existing building energy data and methods for their collection are described by Mantha et al. [9]. UBEM has emerged in recent years as an efficient hybrid of top-down statistical and bottom-up engineering approaches [4,10]. It is expected to become a main planning tool for energy utilities, municipalities, urban planners and other professionals [5].
Two main UBEM approaches can be identified from the literature: physical modelling [6] and data-driven modelling [11]. [12] compared these two major UBEM approaches through analysis of performance for physical GIS-based UBEM with detailed dynamic building energy simulation in IDA ICE software and data-driven UBEM based on nonlinear energy signatures (ES). They concluded that both approaches have advantages and disadvantages. Therefore, choice of UBEM approach should be guided by the requirements for the modelling outcomes and available data and resources for each particular case.
Creation of city-wide models requires the synthesis of many models with similar characteristics, which is not possible to do manually. Therefore, UBEMs utilise building archetypes, an approach that provides a compromise between accuracy and speed of simulation [13]. It has been extensively used in the context of national and regional bottom-up building stock models to analyse the current state of building stock and the aggregated impact of energy efficiency policies or new technologies [14e16]. Archetyping is usually performed in two stages: segmentation (also referred as classification) and characterisation (also referred as parameterisation) [5]. Beside segmentation and characterisation performed for archetyping, data preparation, modelling and aggregation (quantification) stages are usually present in UBEMs.
Segmentation is a process of introducing taxonomy to the population of buildings analysed, classifying them into one or several layers of categories. This allows archetypes to be defined either as classes (subsets) of similar buildings, or as typical (sample or virtual) representatives of these classes (archetype buildings). The conventional approach to segmentation is to split the building stock according to schemes defined by modeller in a manual (deterministic) or semi-automatic (statistic) way, usually sticking to splits by socio-economic (type of building usage, income level) [17,18], spatial (climate zone, location) [19], structural (e.g. age, floor area, envelope form, number of floors) [20], energy installation (e.g. heating source, ventilation system, status of refurbishment) [21] or performance (e.g. energy use intensity, total energy, peak power) [22,23] features. However, the development of machine learning techniques (both supervised and unsupervised) allows for more automated statistic segmentation, where the modeller's main input is to define the building (energy) similarity metrics to be used, and not the feature splits themselves. For instance, application of clustering techniques for archetyping has been demonstrated [24,25].
Characterisation is a process of developing representation for each archetype, defining values for the relevant parameters in a deterministic (single value) or probabilistic (distribution) way (see Loga et al. [26] and Burke et al. Burke et al. (2017)). The scope of parameter characterisation depends on the data available, problem addressed and type of building energy performance model used. The values for characterisation are obtained from building data, literature and building surveys [27]. Absent data or data of insufficient granularity can lead to oversimplified and biased archetype characterisation, in which case various calibration methods should be applied [28,29].
Archetyping approaches have received much attention recently, mainly due to their use as a core of UBEMs. However, their purpose and functions can be seen as wider than a bare simplification of the building stock for the purpose of further modelling. a. Aggregation e provide an overview of the urban building stock analysed [6,23]. b. Explanation e highlight the triggers for the energy performance of the urban building stock through interpretative taxonomies [21]. c. Estimation e estimate energy performance of the urban building stock with the quality of the estimate being within the accepted limits [22]. d. Simplification e decrease computational complexity through removing data redundancy [22]. e. Framing e address only those segments of the building stock that are relevant for the problem addressed [30]. f. Compression e maximise the amount of preserved information about the building stock [22]. g. Transferability e apply archetypes from other cases if input data are lacking [26]. h. Benchmarking e compare building energy performance and related strategies for different cases [31].
Case studies were performed on two archetyping processes that both had Stockholm as the target city, with the same input data used. The next section provides the context and describes the focus of the cases analysed.

Stockholm case studies
Stockholm metropolitan area is one of the fastest growing urban areas in Europe [32]. The City of Stockholm 1 aims to become a fossil-free city by 2040 [33] which is an even higher ambition than the overall Swedish national climate target of becoming a fossilfree state by 2050 [34]. Stockholm has an extensive district heating (DH) system that covers around 85% of the total heat energy demand in the city. The system is one of the most advanced multienergy systems in the world and largely corresponds to "fourthgeneration DH" [35]. The largest actor in the system is AB Stockholm Exergi (co-owned by Fortum Group), producing around 8 TWh annually [36], and operating 15 068 heat supply metering points. The electrification of heating highlighted as a trend by Ref. [2] is also evident in Stockholm, greatly increasing the role of heating in power grid capacity.
Both cases analysed in this paper were limited to the heat provision in Stockholm building stock. Case 1 explored the potential for reducing heat energy demand through building retrofitting. Case 2 explored electricity-based heating and its potential to increase the capacity of the local power grid.

Case 1. building retrofitting
Heating and cooling of the building stock is associated with approximately 40% of total GHG emissions in Stockholm. Hence a fundamental target for the City of Stockholm is to decrease building energy use. Stricter requirements on new buildings is part of the strategy. However, since most of the city's building stock 2 already exists, in order to achieve its environmental goals the city intends to introduce energy efficiency measures (varying for different categories of buildings) that will achieve an average decrease of 30% in energy use for heating and cooling [37].
The purpose of Case 1 was to estimate the energy savings potential of large-scale retrofitting, demonstrated for several typical groups of buildings in Stockholm.

Case 2. electric heating
The projected population increase for Stockholm from 800 000 inhabitants in 2008 to 1 200 000 in 2040 imposes strong requirements on development of urban infrastructure [38]. Large investments are being made for rebuilding of the electricity grid in Stockholm region, including construction of new substations and strengthening of existing power lines. 3 This would improve the total grid capacity and allow all city needs to be met, as depicted in Fig. 1. However, the expansion is expected to be accomplished not earlier than 2027, resulting in a noticeable gap between currently available grid capacity and projected electricity demand within the period 2021e2026. If this deficit in grid capacity were to come into effect, it would not only stress regular city operations, imposing hard competition for secure power supply over acting services, but also jeopardise any further city developments dependent on increased power supply. For instance, it would put under risk construction of new urban districts, transition of the transport fleet to electrical vehicles, deployment of more data centres, emergence of new advanced industries relying on electricity, and, consequently, the whole ambition of Stockholm for fast transition to a fossil-free city.
This potential power capacity deficit could be mitigated through either increasing in-city electricity production (e.g. installation of photovoltaics plus storage systems or more power production at combined heat and power plants (CHPs)) or reducing electricity consumption by energy efficiency measures (e.g. switch to LED lightning, load shift through local energy storage and demand response mechanisms) or replacement of electricity by another energy source (e.g. use of DH for indoor heating & cooling). As Stockholm City has a well-developed DH system, switching electricity-based heating buildings to DH can be a possible solution to reduce total electrical power load demand.
The purpose of Case 2 was to explore the building stock using electricity for heating and estimate the potential for decreasing electricity power demand by changing the source of heat for Stockholm's electricity-heated building stock to another energy carrier.

Methodology
The general research design followed the case study method [40]. Two cases were studied: Case 1 addressed the potential for improving building energy efficiency through large-scale building retrofitting, while Case 2 addressed the potential for freeing up power supply capacity through reducing electric heating. While both cases share most of their context (the same city (Stockholm), the same general domain (urban energy) and the same input data), the problems addressed were quite different and can have implications for distinct areas of urban planning and decision making. This resulted in a marked difference in the branches of the urban building modelling workflow for the two cases (Fig. 2).
In this study, parts of UBEM workflow developed in a previous study for analysing strategies for large-scale retrofitting [41] were used. Case 1 followed a hybrid approach, integrating both statistical (model) and physical (simulation) perspectives on the building stock analysed, while Case 2 provided an explicitly statistical view on the problem, and hence insights obtained on the buildings analysed were purely data-driven. Thus the input data (Section 3.1) and statistical modelling (Section 3.2) were the same, but segmentation (Section 3.4), characterisation (Section 3.5) and aggregation (Section 3.7) were conducted differently in each case. Furthermore, additional data transformation (Section 3.3) and building energy simulations (Section 3.5) were performed in only one of the cases.
All calculations were performed in R language [42] in the RStudio IDE [43], with a number of additional packages applied: gridExtra, Metrics, tidyverse [44], UpSetR [45] and VennDiagram. Building energy simulations were conducted in the EnergyPlus [46] energy simulation engine through an interactive interface Design-Builder [47].

Input data
Two main data sources served as input data for both cases: The EPC and measured datasets were linked through matching cadastral codes, as described in Ref. [48]. Buildings that failed to link or were identified as having significant inconsistency (deviation of more than 20%) between EPC and FDH data were removed before the statistical modelling stage (Section 3.2), as it is heavily dependent on FDH data.
Beside EPC and measured data, several additional data sources were used: Climate data on the ambient temperature were used for fitting the statistical models of the building stock and estimating the heat power demand in both cases. Time series for 2012 and the period 1981e2010 with hourly precision were used [49]. Reference data on standardised use and building envelope information were used to characterise additional features required for setting up virtual building energy simulation model archetypes in the retrofitting case. Here the reference data were 0 500 1 000   obtained from the Sveby project [50] and the "Så byggdes husen" book [51].

Statistical models
Statistical modelling of heat demand was used to estimate building energy performance from measured data in both cases. Heating power demand of each building was modelled with energy signatures ES ¼ ðq; c; bÞ obtained from fitting the specific heat load qðtÞ calculated from the FDH data with the following quasilinear regression model: where t is ambient temperature, q is balance point temperature, c is base load (domestic hot water consumption) and b is energy performance coefficient. Fig. 3 exemplifies energy signature for one of the modelled buildings in Stockholm. Energy signature was selected as a simple statistical model that requires only energy use data, and supports comparability across large numbers of dwellings [7]. The energy signature method and its applications have been addressed in a number of studies. (e.g. Refs. [52e59]. One of those studies [59] concluded that the method is robust, with deviation of less than 2% in the estimated heat loss coefficient compared with measured data from two years. It also concluded that the energy signature method is a useful tool when analysing heat loss and heating energy performance from large datasets. The method has been adopted as a European standard (EN 156036:2008).
Energy signatures were validated with indices such as MAPE (Mean Absolute Percentage Error), 4 NRMSE (Normalized Root Mean Square Error) 5 and R 2 (coefficient of determination), as recommended by Ruiz and Bandera [60]. However, since the measured data contained a significant amount of meterings with zero values, MAPE was replaced by MASE (Mean Absolute Scaled Error), as proposed by Hyndman and Koehler [61].

Data transformation
To perform the analysis of electricity-based heat energy demand in Case 2 (Section 4.3), the values of supplied energy reported in EPC data were converted into used (heat) energy following assumptions about seasonal performance factors (SPF) for various heat sources used by the energy auditors during EPC data collection (Table 1): where Q sourcej is amount of heat, h sourcej is seasonal performance factor (SPF) and E sourcej is amount of purchased energy.
In the case of heat pump sources, the distinction was made for the SPF assumptions depending on the building type. Hightemperature systems (h GSHP ¼ 3:1; h ASHPW ¼ 2:6) were assumed for all buildings except single-family dwellings, which were calculated with the assumption of 50/50 mix of high-temperature and low-temperature systems (h GSHP ¼ 3:1þ3:9 2 ¼ 3:5; h ASHPW ¼ 2:6þ3:1 2 ¼ 2:85). Finally, for EAHPs a new type (h EAHP ¼ 2:6) was selected, as this type of installation has greatly expanded recently and is expected to further dominate the market.
In addition, Boolean values for using/non-using a particular heat source were calculated for each building: Boolean representation of used heat sources was required to perform segmentation with heat sources as one of the splitting features.

Segmentation
All segmentation schemes applied followed the logic of rulebased categorisation, where each building can belong exactly to one of a set of classes (archetypes): cbuilding; d!k : building2A k where A k is arbitrary archetype and A ∪A ¼ U is the whole population of buildings (all buildings in district, city etc.).
Depending on the requirements for archetyping, the residual archetype could be an empty set, A ¼ ∅ (the whole population is required to be covered with archetypes) or not (only A subset is relevant for archetyping). The archetypes fA k g were defined through imposing mutually exclusive sets of conditions on the building features in EPC data.
In Case 1 (building retrofitting), the archetyping process was mainly driven by expert knowledge and therefore the resulting archetypes were defined directly through conditioning by usage type of buildings, their age and connectivity to DH network, leaving the remaining part of the city beyond the scope of the study: In Case 2 (electric heating), the archetyping was conducted in a trial and error manner, updating the segmentation scheme until an acceptable balance between the complexity and applicability was found.
Splitting by three general building usage types B3 ¼ fS; M; Og was initially applied, to represent the difference in energy performance due to type of building use (S ¼ single family dwellings, M ¼ multi-residential buildings, and O ¼ other buildings (offices, schools, hospitals, etc.)). Further, various segmentations by types of heat sources were elaborated. Depending on the required level of detail and complexity, the heat sources (Table 1) were grouped together or kept separately, deriving a number of grouping schemes Hn where n ¼ f3; 6; 8; 13g is the number of types of heat sources used in the scheme ( Table 2).
These grouping schemes served as the basis for generation of spaces with all possible heat mixes, defined as a power set over the heat source grouping scheme: In this way, the maximum possible number of subsets (all possible combinations of n heat sources from the scheme Hn) would be jP ðHnÞj ¼ 2 n À 1 (∅ is skipped as not applicable in this case).
For the purpose of simplicity, hereafter we refer to any mix element of such sets of mixes by enumerating only the types of heat sources used (QB sourcej ¼ 1). For example, for the grouping scheme H3 ¼ fDH;FBH;EHg, the number of possible mixes is jP ðH3Þj ¼ 2 3 À 1 ¼ 7 and the buildings using both DH and EH would belong to the subset DH ¼ 1; FBH ¼ 0; EH ¼ 1 or, if using short notation DH; EH or DHþEH, which is the element of P ðH3Þ.
Two main groups of heat mixes involving EH were distinguished in P ðH3Þ to perform the detailed analysis on the amount of heat consumed and associated power demand per combination of building types (B3, Table 3) and electric heat sources (H6, Table 2). The first group, "EH mainly" ðG I ¼ EHnDH ¼ fEHg∪fEH; FBHg Þ; stands for the buildings using electricity, both solely and in mixture with fuel burners (FBH), but without district heating (DH). Thus group G I unites all buildings that are potentially harder targets to be switched from EH, as they are not connected to the DH system, and FBH is often not a better alternative to consider due to the higher local emissions. The second group, "EH partly" ðG II ¼ EH∩DH ¼ fEH; DHg∩fEH; DH; FBHg Þ, unites all buildings using both electricity and district heating, whether with or without FBH. Hence, group G II unites all buildings that have good potential for switching, as they are already connected to the DH network, so no or small capital investments are associated with the switch and it can be achieved in a short time through proper economic incentives.
Finally, the resulting archetypes were defined though segmentation of buildings by Cartesian product of three usage types B3 and two target groups of heat source mixes G2: where the B3 scheme covered all buildings, distinguished by usage types, and G2 corresponded only to the buildings using electricity for heating, distinguished by the connectivity to DH. Accordingly,  all buildings not using electricity for heating (A) were excluded from the study. In the case of each archetype, the information regarding EH-based heat provision was split by EH solutions of the H6 scheme. Therefore, in total there were jB3 Â G2j ¼ 2, 3 ¼ 6 archetypes, but with additional detalisation for four EH solutions (ER, GSHP, EAHP, ASHP). Hereafter, this scheme is referred to as B3G2H6.

Characterisation
Each archetype subset was analysed for outliers [62], which were removed in the event of low confidence ðp < 0:01Þ for the heating energy use intensity (EUI heat , kWh =m 2 :yr). This check targeted only those records in EPC data with a high probability of error, in order to ensure that there would be no significant impact from extreme outliers. Additional outlier removal was then conducted through applying the low confidence threshold ðp < 0:01Þ to energy signature parameters q; c; b obtained in the modelling stage per archetype subset. This move allowed erroneous meterings (usually caused by sensor faults) to be efficiently pinpointed and the impact of the respective statistical models on the aggregated archetype models to be eliminated.
In both cases, the archetypes obtained were characterised through a deterministic approach using the data from all buildings in each archetype subset.
In Case 1 (building retrofitting), the virtual archetype building for each of the target archetypes was constructed using weight averaging of the corresponding archetype subset data by A temp : The additional assumptions for the parameters not present in EPC data and required for building energy simulations (window-towall ratio, U-values, standardised use) were made according to the reference data.
In Case 2 (electric heating), archetype subsets were described providing average and total values for the relevant parameters (amount of buildings, heated area A temp , absolute and specific energy, and peak power).
Besides that, heated area A temp values reported in EPC data were split up proportionally to the amount of heat per source of heating j for each building i: so that: Virtual areas corresponding to the share of area heated with particular source j (A i;sourcej Þ were further used to estimate heating power demand per source through the energy signature models obtained. Thus for buildings using more than one type of source for heating, the power demand estimation was based on scaling energy signature output proportionally to the share of this source in the reported total heat energy use. The peak heating power demand of each building type (scheme B3) was then estimated through applying the energy signature models (1) for the temperature of À21 C. 6 Finally, reference average values for each type of electric heating solution involved and average buildings in each archetype were obtained through averaging the corresponding totals by virtual heated areas and number of buildings, respectively.

Building energy simulations
Building energy simulations of virtual archetype buildings were used to estimate the effect of various energy conservation measures on the energy performance of the buildings analysed. First, a building energy simulation model was created for each building archetype in the computer simulation tool DesignBuilder. The model was set up to the characterisation obtained (Section 3.5). The simulation model obtained was then calibrated against the energy signature for the archetype building ES archetypei that was derived from the buildings of the archetype i using (2).
Second, building retrofitting packages (one or several retrofitting measures) that are applicable for buildings of each archetype analysed were identified.
Third, the virtual archetype building models obtained were used as baseline to develop derivative models simulating the updated building energy performance due to implementation of one of the selected retrofitting packages.
Finally, the new energy signatures for each virtual archetype building in each retrofitting package were used to perform a scaleback, to update the energy signature for each building in the subset archetype i , as exemplified for the balance point temperature parameter: where q archetypei and q building are, respectively, baseline balance point temperature of the virtual archetype building and the modelled building; and q * archetypei and q * building are, respectively, estimated balance point temperature of the virtual archetype building and the modelled building after implementation of the retrofitting package.

Aggregation
In Case 1 (building retrofitting), the aggregation involved calculation of specific savings per building for each retrofitting package through differentiating the specific heat demand at the baseline and after the particular retrofitting package was applied; and the total savings per package and archetype through direct summation of the absolute savings of all buildings in the archetype set.
In Case 2 (electric heating), the amount of buildings, total heated area A temp and the total heat energy demand per archetype in the segmentation schemes considered (B3, P ðH3Þ, P ðH6Þ and B3G2H6) were calculated to make a choice about the segmentation scheme to be applied. Afterwards, electrical energy and power demand aggregations were obtained through inverse application of the SPFs for each type of heat source involved (Table 1). Both energy and power estimates were projected into corresponding total annual demand of electrical energy and peak power demand.

Results
The data-driven building archetypes approach was tested on the building stock of Stockholm for the two cases described in Section 2.2. Statistical modelling of the heating demand from the FDH data was required for the purpose of both cases, and is therefore presented separately. The further workflow was different in the two cases, but it was based on the same EPC data, comprising 30 472 energy declarations as a representation sample for the whole city building stock.

Statistical modelling of heat demand
The whole DH network of Stockholm Exergi is represented by 15 068 DH metering points. After filtering by the scope of Stockholm municipality, 12 938 metering points were left. The hourly meterings of the heat energy consumption at these metering points were used to model the heat energy demand, as described in Section 3.2. Finally, after performing the linkage of EPC and FDH data and dropping the outliers, 6732 metering points were left, and were used for further analysis in both cases. The accuracy of the energy signature models developed was assessed using the quality indices depicted in Fig. 4.
As can be seen from Fig. 4, the majority of the energy signature models obtained had an acceptable magnitude of error, e.g. under ASHRAE Guideline 14 [63] MASE, NRMSE and R 2 should meet the criteria ±10%, <30% and >0.75, respectively. Hence, it was reasonable to use these models to calibrate and estimate the specific heat demand of archetype buildings constructed in both cases. 7

Case 1. Building retrofitting
Three building archetypes were developed for the building retrofitting case (Table 4) as described in Section 3.4. 8 Archetype A 1 represents multi-residential buildings constructed in 1946e1975, corresponding to the two consecutive periods of developing affordable housing in Sweden: Folkhemsbygget (1946e1960) and Miljonprogrammet (1961e1975). This is therefore the most widespread type of building not only in Stockholm, but generally in Sweden. Archetype A 2 represents all office buildings and was chosen because it is another widespread type of building, having a noticeably different type of activities compared with A 1 buildings. Archetype A 3 represents recently (1996e …) built multi-residential buildings, and was chosen as it reflects the stateof-the-art of Swedish building construction. New buildings are often used to benchmark the energy performance of older buildings. At the same time, the potential for energy saving in new buildings can indicate the efficiency of the current building code and further need for improvement. Finally, all three archetypes were limited to buildings using DH as their main source of heat, to use the energy signatures obtained in Section 4.1. The statistical distribution of the features further used for construction of the virtual archetype buildings is depicted in Fig. 5. A detailed summary is provided in Appendix A (Table).
Virtual archetype buildings were created in DesignBuilder from the reference data and the aggregated features of the buildings characterised from each archetype subset according to (2).
After the baseline simulation models were set up, they were calibrated using the aggregated energy signature of each archetype (r 0 in Table 5) obtained through averaging the energy signature model parameters (2). The calibrated dynamic simulation models  Boxes represent 25th percentile, median estimate and 75th percentile. Whiskers represent minimums and maximums for 1.5 IQR (inter-quartile range). For more details, see e.g. Ref. [74]. 8 Hereafter, metering points are limited to those used for calibration purposes, with a verified linkage with the buildings analysed and providing heat to buildings of only one archetype.
of the virtual archetype buildings were then updated to model different retrofitting packages selected for analysis for each archetype ( Table 6). The updated energy signatures of the virtual archetype buildings (Table 5) were used to project the modified energy performance of each building in the different archetype subsets for each retrofitting package analysed, applying the scaling technique (4). Fig. 6 depicts the estimates of energy savings obtained for various building energy retrofits, aggregated per archetype subset and retrofitting package applied.
As can be observed from Fig. 6a, the specific savings for a particular retrofitting package varied depending on the archetype applied (heat recovery ventilation in A1 and A3). In the case of combined packages (r101-r103), the total saving effect was sometimes noticeably less than that of the same measures individually, which confirmed previous findings [64]. The total absolute savings for buildings in three archetypes (Fig. 6b) were estimated at 738 GWh/year.

Case 2. Electric heating
The segmentation was conducted in two stages. First, the building stock was split into three non-overlapping groups of buildings (hereafter B3): S ¼ single family dwellings, M ¼ multi-residential buildings, and O ¼ other buildings (offices, schools, hospitals, etc.), as illustrated in Table 3. This stage of segmentation was performed using the tax agency (Skatteverket) building categories (see Table A2 in Appendix A for full details). This segmentation scheme was chosen as it is well-established and used by many researchers (e.g. Ref. [65] and practitioners (e.g. Ref. [66] and as it provided direct consistency with other energy-related studies performed by the City of Stockholm. 9 Second, the building stock was further split by the heat source types. The initial EPC data contained information about the heat sources distinguished into 13 types, and therefore several grouping schemes were applied in the further analysis depending on the question addressed ( Table 2). The H13 scheme was simplified to a H8 scheme through combining all electric radiators, air-source heat pumps and biofuel-based heaters into three overarching groups. The H3 scheme was then set up as the most general, providing a split of the heat sources into three main types (DH ¼ district heating, FBH ¼ fossil-and biofuel-based heating, and EH ¼ electric heating), which provided a general overview of the building stock by total amount of heat and the P ðH3Þ mixes of used heat sources (Fig. 7a). However, further application of the P ðH8Þ scheme revealed an overwhelmingly large variety (n ¼ 77) of mixes of heat sources. Therefore, to reduce this complexity, the H6 scheme was constructed and further applied to analyse the most typical mixes of heat sources (Fig. 7b). It provided a reasonable compromise for analysis of mixes through keeping more details for the EH heating solutions that were the focus of this study, while all other types of heat sources were simplified to DH and fuel-based burners (FBH).
The total number of archetypes that would derive from using P ðH3Þ or P ðH6Þ in the whole segmentation would be too large; in the case of P ðH3Þ the upper limit on the number of resulting archetypes was jB3 Â P ðH3Þj ¼ 3$ð2 3 e 1Þ ¼ 3$7 ¼ 21 archetypes, and for the case of the P ðH6Þ it was accordingly jB3 Â P ðH6Þj ¼ 3$ð2 6 e 1Þ ¼ 3$63 ¼ 189 archetypes. While such microdivisioning could still be relevant to shape policies and incentives targeting very particular narrow segments, that was clearly not practical from the perspective of this study.
As can be seen from the structure of the heating by heat sources (Fig. 7), DH plays a dominant role in provision of heating in Stockholm, being associated with~85% of the total city heat demand. Even though electric heating represents only 10.5% of the heat demand, city-wise this is still a considerable amount of energy (721 GWh heat or 383 GWh electro ), which is currently the subject for further growth due to further efficiency improvement of heat pumps and the relatively low price of electricity. The graphic illustration of the mixes also highlighted the marked difference in the role of EH-based heat sources for sole EH usage (one or several EH-based solutions) or mixed EH usage, when EH solutions are used in combination with DH. Therefore, two main target groups, G I ("EH mainly") and G II ("EH partly"), were identified from the space of mixes of types of heating sources P ðH3Þ. The whole segmentation scheme was then constructed as a Cartesian product of three building types (Table 3) and these two groups as described in Section 3.4, resulting in the scheme B3G2H6. The counts of buildings per archetype/EH solution are shown in Table A3 in Appendix A 10 .  A 3 IT21 þ EEW þ HRV Adapt indoor temperature þ energy efficient windows þ heat recovery ventilation 9 Hereafter, metering points are limited to those used for characterisation purposes, which had a verified linkage with the buildings analysed and provided heat to buildings of only one archetype.
To better understand the character of the load covered by separate types of EH heat sources, shares of particular solutions Qsource j Q total were calculated for each building per archetype (Fig. 8). This graph proved to be particularly useful in representing the technological differences between the solutions. The width of the interquartile range (IQR) can be treated as a characteristic for the diversity of the installed systems, ranging from rather narrow ranges of 5e20% width (e.g. GSHP in G I or ER in G II ) to basically universal ranges of more than 60% for such general sets of heat sources as FBH or EH. Comparing the distributions of shares group-wise further demonstrated the significant difference between the two target groups, e.g. there was a clear distinction for the modes using GSHP depending on the availability of DH (G I vs. G II ). Furthermore, this allowed the systems used to be classified by the type of load addressed, e.g. GSHP in G I corresponds to the base load, ASHP in both G I and G II can be attributed to intermediate load, and ER in G II corresponds to the peak load. While these observations are not of significant novelty as information about these general types of heat sources, they can clearly be useful for justification of modelling assumptions in energy studies on a large scale (e.g. district or urban).
Following the B3G2H6 scheme, the total heat energy demand per archetype was calculated with further disaggregation by types of EH solutions (Table 7).
Applying the respective SPF coefficients (Table 1) to the annual heat values (Table 7) backwards, estimates for the annual amounts of electrical energy consumed for heating purposes were obtained ( Table 8).
As follows from Table 8, the electricity demand for the building stock is 81% for G I and 19% for G II , where large buildings (multiresidentials & other) constitute the majority in the group of 'potential switchers' (G II , EH partly). It can be also seen that electric radiators (ER), which usually require relatively low capital investments, are responsible for more than half of the total energy demand. Fig. 7. Simplified (a) and detailed (b) structure of the heat demand in Stockholm building stock by combinations of heat sources according to EPC data. Colours corresponds to scheme H3. Buildings using district heating (blue) are provided with the measured FDH data. Buildings using electricity for heating (red) are the focus of this study. Buildings using fossil and biofuels are marked in brown.
To obtain the estimates for peak power demand, the energy signature models of buildings using DH only (DHnðEH∪FBHÞ) were used (Section 4.1). They were subset and aggregated per building type in B3 (S ¼ single-family, M ¼ multi-residential, O ¼ other) following (2). The parameters for resulting energy signatures per building type are given in Table 9.
Virtual areas corresponding to the share of area heated with a particular source were then used to estimate the heat power demand per heating solution in each building according to (3). A summary of the 'virtual areas' is provided in TableA4 in Appendix A.
Applying the building type energy signature models per 'virtual area' of each building allowed the peak heat and respective electrical power demand of the buildings associated with EH to be estimated. The estimated values and structure of the peak electrical power demand (required to provide electric heating at À21 C) are shown in Table 10 and Fig. 9, respectively.
Finally, the counts obtained for required electricity demand and the estimates for peak electric power required for electric heating were used to provide average peak electric power per building (Table 11), average peak electric power per area (Table 12) and average annual specific energy demand (Table 13).
The analysis allowed the lower margin for peak electric power demand for electric heating to be estimated as 147 MW. Of this, 119 MW corresponded to G I (EH mainly) and 35 MW to G II (EH partly), i.e. the group of buildings connected to the DH network, as confirmed by non-zero consumption of heat attributed to DH. While this power is clearly not enough to compensate for the expected deficit for power capacity (Fig. 1), it has potential to provide additional flexibility to the urban energy system of Stockholm.

Discussion
The exponential growth in the availability and quality of building energy data, along with the continuous development of computational infrastructures and analytical methods, sets a welcoming environment for further expansion of UBEM-based tailor-made solutions to address various urban energy challenges, utilising as much information as possible for each problem context. This drives further improvements in the quality of the models developed, and, at the same time, decreases tolerance to simplifications implied by the universal archetyping schemes. The results obtained in this study suggest that any urban building energy modelling exercise should recognise the need for alignment of the chosen archetyping scheme with the problem to be solved, data provided, methods used and project resources available.
Until recently, archetypes were mainly developed as universal reference input data for individual building energy models (BEMs) Fig. 8. Shares of particular solutions for heating in the total heat supply Qsource j Qtotal of each building, segmented by building types (scheme B3: S ¼ single-family, M ¼ multi-residential, O ¼ other) and two target groups (scheme G2: G I ¼ EH mainly and G II ¼ EH partly). Heating solutions are structured by the scheme H6 (DH ¼ district heating, FBH ¼ fossil-and biofuel-based, EH ¼ electric, split into ER ¼ electric radiators, GSHP ¼ ground-source heat pumps, EAHP ¼ exhaust air-source heat pumps, and ASHP ¼ air-source heat pumps). Table 7 Annual electric heat (EH) consumption of buildings segmented by archetype schemes B3EH (total EH), B3G2 (groups G I and G II ), B3H6 (by each type of EH solution in H6), and total GWh th . partly)   B3  TOTAL  Sum  ER  GSHP  EAHP  ASHP  Sum  ER  GSHP  EAHP  ASHP   S  296  295  80  137  16  63  1  1  0  0  0  M  251  169  38  113  13  6  82  13  20  47  2  O  173  117  59  41  1  17  56  13  14  15  15  TOTAL  721  582  177  291  29  85  139  26  34  61  17   Table 8 Annual electrical energy consumption associated with electric heat (EH) in buildings segmented by the archetyping scheme B3G2H6, GWh el . Sum  ER  GSHP  EAHP  ASHP   S  151  150  80  39  7  24  1  1  0  0  0  M  124  82  38  36  6  2  41  13  7  21  1  O  109  79  59  13  0  6  29  13  4  7  6  TOTAL  383  312  177  89  13  32  72  26  11 28 7 and the earliest UBEMs [26,67]. However, the topicality of such building typologies is decreasing with the growing availability of data and the recent developments in UBEM [7,9]. As more rich datasets emerge, the issue of similarity from a building energy  9. Structure of the peak electrical power demand (corresponding to the electric heating at À21 C) according to the archetyping scheme B3G2H6. More saturated segments above correspond to G I (EH mainly), less saturated segments below to G II (EH partly). performance perspective is gaining more importance, as it allows data-driven estimation of building energy performance to be performed for different building stocks analysed using individual building-centric clusters of similar buildings [24,25]. This would eliminate the subjectivity of modellers and increase the scalability and transferability of the results obtained. Furthermore, ondemand segmentation would provide the possibility to focus on those aspects of building energy performance that are relevant for a particular problem, improving the efficiency of the archetypes derived from the perspective of complexity and interpretability of modelling outcomes. The developments in this study allowed the following criteria for building archetypes to be formulated: Quality e amount of error induced by the archetype-caused simplification.
Computational complexity e structural and computing complexity associated with the archetypes developed. Intensity e redundancy for the archetyping scheme given the requirements on the levels of quality and/or detail. Expendability e flexibility in regulating of the level of detail for the archetyping scheme. Data demand e amount and type of input data required. Data robustness e sensitivity to data anomalies. Interpretability e understanding for energy performance of the building stock. Representativeness e representation for all aspects of the problem addressed. Transferability e possibility of applying the archetypes developed in other cases. Completeness e coverage of the building stock. Customisability e possibility of user extension of archetypes depending on the problem addressed.
Many of the above-mentioned criteria were relevant for both cases in this study. In particular, in terms of data demand, spatial or building geometry data were not needed for input in both cases. It was more important to monitor the quality of archetyping in Case 1 (building retrofitting), while this was not recognised as a problem for Case 2 (electric heating), which had a limited amount of uncertainty added to data by the archetyping process. With regard to expandability, flexibility in regulating the level of detail for the archetyping scheme was not possible for archetypes in Case 1, while Case 2 had this feature by design. Conversely, transferability was not feasible for the results of archetyping in Case 2, while in Case 1 the results of archetyping could definitely be transferred to other Swedish cities. More importantly, the archetyping approaches used in both cases are transferrable to any other Swedish city.
In both cases, the chosen segmentation (categories and their granularity) was the result of a compromise between the level of detail and generalisation considered appropriate for the problem analysis, as demonstrated in previous studies [12,22].
In Case 1 (building retrofitting), the archetyping aimed to obtain archetype subsets that would be not only sufficiently large, but also sufficiently homogeneous to ensure feasibility of the proposed retrofitting measures and to allow for justification of assumptions about typical activities and related internal gains from the reference data. Furthermore, the set of buildings analysed was limited only to those connected to the district heating network, as this allowed building energy performance of all buildings analysed to be statistically modelled using the measured data on actual heat energy use.
In Case 2 (electric heating), the segmentation aimed to cover the whole city building stock and was more general, as further splits could be performed based on the source of heat. The particular segmentation schemes were used according to the needs of Stockholm City energy planning unit and were largely required to preserve compliance with previous and adjacent studies. At the same time, the approach presented here allows full flexibility in setting up any other segmentation scheme that can be expressed through conditioning of features present in energy declarations, allowing instant update of all results obtained.
Limitations of this study concern the uncertainty embedded in the input data, statistical models and approaches used for characterisation and peak power estimation. The EPC data used as core input data for both cases have proven to be an important source of information about building stock, especially in Sweden. Though the data contain some errors [48,68], to our knowledge they are the best available data source on building energy on the urban scale in Sweden. However, there are two peculiarities of this data source that affected the results in Case 2. First, the national regulations for buildings to have an energy audit are not as demanding in the case of single-family houses, which resulted in significant underrepresentation of this segment in the energy and power calculations performed. According to the national statistics agency, Stockholm has 42 867 single-family buildings [69], but only 13 123 were represented in the input data. Second, the energy audits, which are the only input for the EPC register, are required to be conducted once a decade, and therefore a noticeable temporal lag is present in the results for Case 2. For example, according to EPC data, a recently built environmental district Hammarby Sj€ ostad has 59 housing associations with only 1 GSHP, 2 EAHPs and 0 ASHPs installed, while in reality there are confirmed installations for 7 GSHPs, 19 EAHPs and 16 ASHPs [70]. This is in line with the general trend of increased penetration of heat pump technology [2].
Regarding the statistical models, it should be noted that, even though energy signature models embed end uses and thermal mass of buildings [4], they do not allow these effects, which obviously are different per building analysed, to be isolated and estimated. Besides, some particular deviations related to peak frosts and base summer loads could be observed for individual buildings, which corresponds to previous observations [12].
As for characterisation, even though a deterministic approach is known to be less precise than a probabilistic approach [19], it was assumed to be sufficient due to: a) further calibration and scaling of models with energy signatures based in Case 1; and b) purely exploratory and compression function of archetypes in Case 2, which clearly could be characterised probabilistically if further required as a model input.
The results obtained in Case 1 (building retrofitting) do not account for the rebound effect and energy performance gap that has been observed after introducing retrofitting packages [31]. To obtain a data-driven estimate of this phenomenon, an additional validation study based on measured data for buildings undergoing retrofitting projects could be performed [71].
The estimation of peak power in Case 2 (electric heating) was based on the assumption that buildings using EH have on average the same heat load profiles as those connected to DH, which allowed the statistical models of heat load profiles to be applied for different building usage types (scheme B3: S ¼ single-family, M ¼ multi-residential, O ¼ other) derived from FDH data. To avoid such uncertainty, the verification study could be performed using the high-resolution metering data of building electricity usage structured by types of heating solutions. However, neither applied nor proposed approaches would account for application of heating appliances embedded in regular household loads.
It is worth noting that the cooling demand was beyond the scope of the study. Stockholm has a developed district cooling system [65] and the cooling demand can be modelled with energy signatures fitted by actual energy use in the same way it was done for heat. However, the possible double use of heat pumps and the low granularity of the information about cooling sources in EPC data made it scarcely possible to estimate the cooling demand covered by electricity.
As can be observed throughout this paper, mean values and sample distributions remain the main form of aggregating building stock information for the purpose of simplifying the models. However, median and IQR should also be considered for representation purposes in the case of skewed distributions, as they are less sensitive to outliers. Therefore Tukey boxplots were used in a number of diagrams in this paper.
The spatial perspective was beyond the scope of this paper. However, it is another dimension of analysis that is relevant for energy planners, especially for Case 2. To explore the potential synergies between the local power distribution grid and DH pipe network, their topologies would needed to be linked. This would allow existing surpluses and deficits of capacity with the EH-related loads identified in this study to be mapped and an action plan for local stakeholders to be developed.
Sweden already has precedent, as various policies have been introduced to stimulate conversion to energy sources for space heating other than electricity [72]. This study did not seek to identify any solution for the power capacity problem. Instead, it aimed to highlight the additional potential for flexibility embedded in electric heating that could be utilised through more intelligent orchestration of energy in cities involving technologies and policies developed jointly by government, grid operators, energy utilities and cities.

Conclusions
The study demonstrated how using rich datasets allows different building archetypes to be developed, depending on the urban energy challenges addressed. Two cases were analysed for Stockholm based on energy performance certificates (EPC) and measured data. In the case of building energy retrofitting, three building archetypes (multi-residential buildings 1946e1975, offices and multi-residential buildings 1996e…) were developed, which allowed energy savings for seven retrofitting packages to be estimated. In the case of electric heating (EH), six archetypes were developed as a combination of three usage types (single-family houses, multi-residential buildings and other) and two groups of buildings using EH by heat sources (using DH and not using DH), with details for four main types of EH solutions (electric radiators, ground-, exhaust air-and air-source heat pumps). The identified lower bound for peak electric power for EH (147 MW) could be addressed to improve power capacity of the local electric grid.
The noticeable improvement in UBEM workflows demonstrated in this paper derives from the use of the EPC data linked with measured data, accompanied by data quality control. It allowed statistical models to be fitted prior to the segmentation, independent of the choice of archetype segmentation scheme, and two different urban energy challenges to be explored using the same input data, as demonstrated with the two cases analysed. The growing availability of data greatly decreases the amount of assumptions required about building parameters in analysis or simulations, which can now rely largely on actual information about individual buildings. However, this imposes new challenges related to data quality and consistency, along with more intelligent automatic segmentations that are required to provide computational scalability of constructed UBEMs.
The cases for building retrofitting and electric heating analysed here exemplify how the emergence of linked building energy data gradually changes the paradigm of UBEM itself, driving further progression from stand-alone energy datasets with single-logic models to urban data lakes fuelling on-request multiple-purpose data intelligence services.