Global Building Morphology Indicators

Characterising and analysing urban morphology is a continuous task in urban data science, environmental analyses, and many other domains. As the availability and quality of data on them have been increasing, buildings have gained more attention. However, tools and data facilitating large-scale studies, together with an interdisciplinary consensus on metrics, remain scarce and often inadequate. We present Global Building Morphology Indicators (GBMI) — a three-pronged contribution addressing such shortcomings: (i) a comprehensive list of hundreds of building form multi-scale measures derived through a systematic literature review; (ii) a methodology and tool for the computation of these metrics in a database suited for big data and comparative studies, and release the code freely and open-source; and (iii) we carry out the computations using high performance computing, generating a public repository with data quantifying the form of selected urban areas around the world, and demonstrate their value with novel analyses comparing morphological parameters across cities. GBMI introduces a formalised, structured, modular, and extensible method to compute, manage, and disseminate urban indicators at a large scale and high resolution, while the precomputed dataset facilitates comparative studies. The theory and implementation traverse multiple scales: at the building level, both individual and contextual ones based on encircling buildings by multiple buffers, and aggregations at several hierarchical administrative levels and at multiple grids. Our open dataset, comprising billions of records on a growing scope of urban areas worldwide, is the most comprehensive instance of morphological data parametrising the individual building stock, supporting studies in urban analytics and a range of disciplines

We identify three major research gaps in this domain, which we tackle in this cohesive and integrated paper.First, even though great strides have been made in the field and there is a mountain of research papers focused on parametrising buildings under the umbrella of urban form, to the extent of our knowledge, there is no review or inventory, nor a consensus on them.In this paper, we conduct a systematic review to identify commonly used indicators and shape a framework, and present the results together with observations from the review.With scalability and multidisciplinarity in mind, we focus on indicators that do not require data that are not widely available (i.e.we use building footprints) and those that do not require complex simulation software that is exclusive to a particular domain and not scalable.While we encounter a heterogeneous landscape of metrics and terminological disparities, we identify common patterns and derive a list of indicators that may serve a kaleidoscopic set of topics, and introduce new derivative parameters, furthering possibilities in this domain.
Second, tools to compute such building form metrics are scarce, and they may not be suited to handle large-scale needs.As the scale of morphological analyses, in particular on street networks, is growing and sometimes including nation-wide studies and beyond (Boeing, 2020;Liu, Chen, Li, & Chen, 2020;Zhou, Lin, & Bao, 2021), we believe that work supporting wide ranging analyses using building data is necessary and timely.Considering the growth of building data, e.g.open government data on buildings and large-scale data released by companies (Heris, Foks, Bagstad, Troy, & Ancona, 2020;Sirko et al., 2021) and academia (Zhang et al., 2022), studies using buildings will only grow in importance in the future.Our paper presents the development of a tool to implement the computation of hundreds of indicators structured following the review and the creation a large-scale database, which is suited for big data management and analyses, as we demonstrate with several examples.Implementing the work as a database rather than as desktop tool as most of related work (Section 2), has many advantages that will be elaborated later.The developed code is free and opensource, it is extensible (e.g.allowing the addition of new indicators if necessary), customisable (e.g.supporting different forms of input and output data), and flexible (e.g. it can be used to compute the indicators anywhere at nearly any scale and resolution).
Third, while the computation of such indicators may be laborious, consuming time of researchers, especially in studies that are increasingly including multiple locations (Kraff, Wurm, & Taubenböck, 2020), there are no publicly available datasets with precomputed indicators that can be readily used.Having a ready-to-use dataset may save substantial effort to researchers, proliferate urban morphology among less computationally inclined researchers, and enable quick comparative studies involving multiple cities, which is uncommon (based on our review presented in Section 3).We engage the developed method and high performance computing to calculate indicators of buildings in many cities around the world where such data are available (mostly thanks to OpenStreetMap (OSM)), and aggregate them according to zones defined by both a global administrative and a global gridded population dataset at fine scale.The precomputed datasets are accommodated in a public repository, a continuous and growing effort.This part also affirms the large-scale feasibility of our work and confirms the advantages of the database approach, as the generated dataset is of unprecedented size, covering many urban areas in the world that are well mapped in OSM and containing billions of records.At the same time, the database maintains simplicity and efficiency, and provides ease of extracting data.Further, while we focus on OSM, the developed software architecture is versatile and the approach is dataset-agnostic, as different and multiple datasets can be used to derive the metrics, allowing customisation and suiting one's needs.We release this dataset as open data under a liberal license so that anyone can use it without restrictions.To the extent of our knowledge, this dataset presents the most comprehensive one in this domain and, together with the established methodology, we hope that it will contribute to the field in multiple ways.Our work by design enables computation at any scale and at any location, and given the worldwide focus as demonstrated with the datasets and novel analyses that we showcase in this paper, we title our contribution as Global Building Morphology Indicators (GBMI).
The paper is organised as follows.Section 2 expands the introduction by affirming the relevance of building indicators in urban morphology studies and discusses related work.Section 3 presents the compendium of indicators characterising building morphology, which are derived from a systematic literature review.In Section 4, we present the implementation portion of GBMI: an open-source software that generates a structured and easily accessible database, and a large-scale dataset that we release as open data.An insight in the generated data is given with several analyses in Section 5 suggesting urban textures and fingerprints of cities, and asserting the importance of the work for both studying patterns within single cities and comparative analyses, another scientific contribution.Section 6 discusses the work revealing challenges, limitations, and opportunities, while Section 7 concludes the paper.

Examples of urban form studies relying on data of buildings
There are many examples of using building data to parametrise the urban form.Some are given in this section to provide an understanding of the applications across various disciplines.
Urban morphology has been an integral component of the work by Chen et al. (2020) on microclimate simulations.They zero in on several indicators, such as building coverage and total exterior wall area, mainly derived from a climate study showing their association to air temperature (Jin, Cui, Wong, & Ignatius, 2018).Further examples of microclimate studies relying on quantitative urban form metrics are many (Ng, Yuan, Chen, Ren, & Fung, 2011;Tong et al., 2018).
The paper of Zhu et al. (2020) is another environmental work, describing the effect of urban morphology on the solar capacity of ten cities and discussing what urban form is most desirable to effectively harness solar energy.Similarly to others, they take into account the average height of buildings in an area, but also the standard deviation of the values to indicate their variation in a study area.In the same domain, Morganti, Salvati, Coch, and Cecere (2017) determine seven indicators and investigate their association with the solar availability on façades of buildings.The analysis has been carried out at the neighbourhood level in two cities, and it reveals that the two most useful indicators for this purpose are the ratio of the built-up area to the urban site area (named as gross space index) and the ratio of the area of the walls of buildings to the urban site area (named as façade-to-site ratio).There are scores of studies in this domain, using a different set of indicators and datasets (de Lemos Martins, Adolphe, Bastos, & de Lemos Martins, 2016), suggesting a lack of consensus or the variability of the importance of different metrics across different geographies.

Related work
There has been some work on developing inventories of indicators, implementing their computation, and generating open datasets.Nonetheless, most of such work is rather in the street network department.For example, Boeing (2021) presents an open dataset of street network indicators with worldwide coverage, generated using a free and opensource software (Boeing, 2017).
Lemoine-Rodrguez, Inostroza, and Zepp (2020) assess the spatial structure of 194 cities over 25 years, based on a set of landscape metrics, identifying clusters of similar cities and various patterns.Their dataset is released openly (Lemoine-Rodriguez, Inostroza, & Zepp, 2020).The method relies on a coarse land use dataset and it is rather focused on the city level.A related effort is the dataset by Demuzere, Bechtel, Middel, and Mills (2019).Their work is focused on classifying local climate zones (LCZ) based on indicators at the block-level.Our approach regards individual buildings and it provides indicators at a fine spatial resolution and at multiple scales, and it could be used as an input to compute LCZs.
When it comes to data, the work that is perhaps the most related to ours is the one of Heris, Foks, et al. (2020).They use a nation-wide building footprints dataset of the United States and rasterise it at high resolution (30 m), computing six metrics for each cell.This dataset has been released as open data.However, the number of indicators is limited, the resolution is fixed, the work does not regard administrative units, and the dataset is focused on a single country.A related effort is presented by Li, Liu, Zhang, Xue, and Li (2021), focusing on dozens of cities in China, with similar shortcomings and the dataset does not appear to be released openly.
There has been previous exploratory work that is focused on indicators that quantitatively measure the urban form.Fleischmann, Romice, and Porta (2020) examine the state of the art of quantitative analysis of the urban form.Among other aspects of the review such as delineating applications and purposes of the quantitative studies, the paper formalises a framework of metrics to characterise the urban fabric.However, the study is rather general, as it is not focused on buildings only.For a related work, see also (Dibble et al., 2017).In our review, we focus primarily on understanding the inventory of indicators and the data that was used to compute them.We have encountered and devised some unlisted indicators that we implement in our work.Basaraner and Cetinkaya (2017) overview 20 indicators to characterise the shape complexity of building footprintsthey investigate their usefulness and the computational complexity they entail.The study indicates that only a handful are sufficiently distinct (e.g.several pairs exhibit strong correlations), suggesting that it is of no use to have many different indicators characterising the same aspect.We learn from their work, but at the same time, we design our approach as modular, so additional indicators can be added easily if necessary.
Tool-wise, to the extent of our knowledge, arguably the work that is most akin to ours is the recent one by Jochem and Tatem (2021).They have developed an R package for calculating 2D morphology metrics from building footprints.Such development affirms the importance of deriving indicators based on building datasets and having easily available solutions to do so.Our work presents the following distinctions and contributions.First, we encompass a substantially larger suite of indicators derived from a systematic literature review.Second, our work includes the additional dimension of vertical indicators where building height data are available.While we are aware that 3D datasets are not available as widely as those in 2D, and thus, there are many areas for which such indicators cannot be computed for the time being, the share of metrics calculated from the vertical extent of the built environment in literature cannot be discounted.Third, a pillar of our work is an open dataset, bypassing employing software and running the computations.Fourth, our architecture is a structured database, which may be more suitable for handling big data and large-scale analyses, as demonstrated by later sections.As confirmed by a recent review of Fleischmann, Feliciotti, and Kerr (2021), a database tool has not yet been developed in this field, a gap we seek to bridge in this paper.

Review methodology
The first part of our triad (inventory, software, and data) is intended to shape the structure and list of indicators, starting with identifying those that are used commonly in the field.For that, we follow the typical approach of systematic literature reviews: we select relevant keywords and search for a set of papers, after which we filter papers relevant for our study, and then extract relevant information from them.
We have searched Scopus for papers that contain the following terms in the title, abstract or keywords: 'urban', 'building', and 'morphology'.The query, which was performed in early January 2021, returned 1566 results.Considering the large number of papers and that a very comprehensive literature review is not the main purpose of this paper, it is not feasible to comb through all of them.As a way to reduce the pool of papers, we decided to focus on a subset containing the latest published papers.An advantage of this approach is that the survey will reflect the state of the art and the current status of the related studies.Therefore, we have selected the latest 100 published papers.In the screening phase, we have examined the title and abstract of each paper in the initial pool.We have proceeded to include a paper in our review only if it is a quantitative paper focusing on indexes derived from building data.For example, we have excluded qualitative and descriptive studies and those in which buildings merely play a minor role.After applying such inclusion criteria, filtering left us with 43 papers for a closer review.For each paper identified as relevant, we have extracted the list of building indicators, and secondary information such as resolution at which the indicators have been computed, type of building data, and data source.

Summary, observations, and considerations
The literature review has been important in guiding the development of the inventory of indicators that we present in the next section, it highlighted the breadth of building form indicators, and it affirmed that urban form studies are largely disconnected, using a non-standardised set of metrics, even within the same research lines.There are several further relevant observations we unpack in this section.
A seemingly straightforward and unambiguous task such as recording the list of building-related indicators found in publications, together with auxiliary information such as understanding the data source(s), was hampered by a few issues.
First, we have observed that a number of papers is not forthcoming in listing and explaining the metrics used in the analysis.In certain cases, for some listed indicators, it is not clear what they represent and how are they computed.Furthermore, it is not always evident what data source and type was used to derive the metrics.Such issues hinder the assessment how feasible is the implementation of some indicators.
Second, as indicated by Fleischmann, Romice, and Porta (2020), morphological studies exhibit terminological inconsistencies, which we can attest to based on our exploration of papers.Besides multiple terms indicating the same concept appear regularly in publications (e.g.metrics, indicators, parameters, factors), we notice that disciplines collide in terminology, using different terms for the same indicator.
Third, the granularity and definition of indicators may be subject to different interpretations.For example, footprint area, height, and volume of a building are three archetypal measures across multiple domains.However, it is common practice to compute the last of the three by multiplying the first two (Asadi, Arefi, & Fathipoor, 2020).A question is raised whether the volume should be considered as a distinct indicator, or simply a derivative indicator from the footprint area and height as more fundamental indicators, and thus, not counted as a unique indicator.We have decided to include such multiplicity if encountered more than once.Hence, in our work, volume is considered as an indicator, and the same goes for other derivative indicators such as the ratio of the building height to its footprint area (Hu, Dai, & Guldmann, 2020).Further examples of such derivative indicators include the area not covered by buildings (i.e. total area minus building coverage area) (Li, Wu, Lin, Li, & Du, 2020), ratio of the width and length of the footprint (Tikhonova & Beirão, 2020), façade to site ratio (i.e.sum of the product between the perimeter and the average building height in the area) (Litardo et al., 2020), surface to volume ratio (Othman & Alshboul, 2020), total area of urban envelopes divided by the corresponding flat area (Zhu et al., 2020), and absolute compactness (i.e.measuring the compactness of settlements by dividing the building volume by the zone area) (Mohabat Doost, Buffa, Brunetta, Salata, & Mutani, 2020).Accounting for all such combinations would be futile.Nonetheless, we have broken down such indicators to make sure we include the lowest common indicator used to compute them.Therefore, while we do not list them all explicitly, all these can be computed with our tool if necessary, given its modular nature and since they are usually computed from existing fundamental indicators that we include in our list.
Moving forward to the observations, in the majority of cases (three quarters), the studies focus on aggregating indicators of buildings at the level of a zone.The zones, which depending on the context, data, and size, are also called areas, plots and sites, may be represented as regular grids or irregular delineations such as administrative or census areas.There are different instances of irregular zones, such as administrative areas, land use parcels of variable size, and clusters of urban form (Li, Zhou, Gong, Seto, & Clinton, 2020;Song, Leng, Xu, Guo, & Zhao, 2020).On the other hand, gridded zones are by definition regular and specified by a resolution, and are common.Their sizes vary, and some researchers opt for multiple levels of aggregation in the same study, e.g.grids varying from 100 to 1000 m (Ribeiro, Martilli, Falls, Zonato, & Villalba, 2021).The indicators that are aggregated are based on the common descriptive statistic measures to indicate central tendency and dispersion of a distribution, such as mean and standard deviation, and are sometimes combined with another value (Li, Schubert, Kropp, & Rybski, 2020;Liao, Hong, & Heo, 2021).For example, such indicators include the coverage of building footprints (i.e.sum of footprint areas of all buildings in a zone divided by the area of the zone, obtaining the site coverage in percentage) and the average building footprint area.
Between the indicators derived independently for each building and the subsequent aggregated indicators derived at the level of irregular or regular zones, there are contextual indicators, i.e. zonal indicators at the building level, that are computed for each building from its surroundings, most often from circular buffers around each building (Milojevic-Dupont et al., 2020;Song, Lu, & Xing, 2020).They are computed for each building, but are dependent on the surrounding buildings, unlike independent indicators such as the complexity of a footprint of a building.An example of such indicator derived from the neighbouring context is the mean distance from the building in focus to all other buildings in a buffer.Further, these contextual building-level indicators may be aggregated at the level of a zone, e.g.mean of the mean distances between buildings, resulting in hierarchical summary statistics and indicating the interrelated and sequential nature of metrics.Consequently, in our catalogue, we outline three groups of indicators: building-level independent, building-level buffer, and zone-level indicators.In our implementation and dataset, we maintain multiscalability in mind.Thus, the building-level buffer and zone-level indicators are computed at different scales and resolutions, suiting the diverse needs of various disciplines.
Data-wise, most studies rely on a single source of data, but overall, a variety of datasets is observed, e.g.vector building footprints occasionally containing also relevant attributes (most common), point clouds, semantic 3D city models, satellite imagery, street view imagery, and digital surface models (Aktas et al., 2020;Chen, Qiu, et al., 2020;Zhu et al., 2020).In spite of the different provenance of data, many of the indicators, e.g.footprint area, are common and can be computed from different datasets.
In the studies, not all indicators are computed from geometric data, and attributes are in some instances also considered as morphological parameters (Carlucci, Zambon, & Salvati, 2019).For example, it is common to come across studies estimating the building height from the number of storeys multiplied by an assumed fixed floor height (Ku & Tsai, 2020;Li, Koks, Taubenböck, & van Vliet, 2020;Liu, Xu, Zhang, & Shu, 2020;Peng et al., 2020).Such computations imply relaxed data quality requirements, as such approach may be prone to errors.
Most of the studies focus on a neighbourhood or a city, but there are exceptions, e.g.comparative studies including dozens of cities (Kraff et al., 2020;Liu, Wang, Qiang, Wu, & Wang, 2020).
The median number of building indicators used in the included papers is 4, with nearly all studies having a single-digit amount of indicators, barring a few instances, e.g. the one by Milojevic-Dupont et al. (2020) including approx.80 metrics thanks to different scales (e.g.multiple buffers), derivative indicators, and summary statistics.
This exploration was instrumental in developing the catalogue of indicators presented next (Section 3.3).Still, we have not moved forward with all the indicators we have identified.First, some indicators are highly localised and categorical.For example, some papers categorise buildings into a few height classes (Cao et al., 2020;Ribeiro et al., 2021;Xia & Li, 2021), and count the share of these classes in an area (e. g. percentage of buildings that are taller than 20 m).These indicators are variations of existing indicators (e.g.building height) and they can be readily computed from the data that our tool generates and the dataset that we have released, if necessary.In fact, such an approach would also give researchers greater flexibility to define their own classes and thresholds, as they are not universal across studies.
Second, there are a few indicators that we have not included as we focus on those that can be computed from widely available datasets (i.e.building footprints) and at a large-scale, not requiring additional and/or scarce datasets, and not using specialised software.These are mostly intricate indicators in the microclimate domain (Sadeghi, Wood, Samali, & de Dear, 2020;Yuan et al., 2019), such as frontal area density, that is, "the area of building surface that approach the dominant wind direction in the area" (Ma & Chen, 2020).Another example is the sky view factor, which is influenced by buildings but also by other urban features, thus requiring also vegetation and other data to compute it (Palme, Privitera, & Rosa, 2020;Xia, Yabuki, & Fukuda, 2021).Further, such indicators are computationally complex to compute (Yuan & Chen, 2011), so scaling them would be computationally impossible, and some studies suggest that they may not always add much in comparison to other building form indicators (Gao, Zhan, Yang, & Liu, 2020).

Inventory of indicators of the urban form pertaining to buildings
We have structured the metrics into three groups: those that are computed for each building based on its characteristics or its surroundings (presented together in Section 3.3.1)and those that are calculated at a spatial unit such as a plot by aggregating building-level indicators gathered from buildings in that area (Section 3.3.2).
In spite of the clear differentiation between these building-level and aggregated indicators, they are intertwined.Computing some buildinglevel indicators first requires computing a set of building level indicators, aggregating them at higher level, and then computing them back at the building level (e.g.rank in the size of each building in comparison to other buildings in the area).

Building-level indicators
Metrics computed at the building level are the essence of virtually all morphometric studies.However, only a minority of studies we have reviewed focuses on such indicators and rather concentrates on their aggregations at a higher level zone.Still, these indicators are essential in computing the aggregated counterparts, thus they should be given proper attention.
The building-level indicators are listed in Table 1.Some of the indicators are illustrated in Fig. 1, while Fig. 2 extends the exemplification by depicting building-level indicators that have been derived from the buffer encircled around the building.
We identify 17 indicators that are computed from the building characteristics alone, in which the surrounding context plays no role; and 26 indicators that are computed from buildings surrounding it.All these indicators may be used to calculate derivative indicators encountered in the literature review.Further, for each indicator, it is possible to compute the rank as the percentile of the metric in the zone in which the building is located.For example, the percentile of the footprint area with respect to a particular zone (in our implementation, we include multiple such values since we regard multiple zones).This metric is also an example of how different levels are linked, as it cannot be computed solely from individual buildings, requiring aggregation at the zone.
The indicators are mostly self-explanatory, with a couple requiring additional explanation.There are three indicators that are computed from the minimum bounding box (MBR) of the building footprint: length, width, and area.Further, there are four indicators that parametrise the shape of the footprint: shape complexity, shape compactness, equivalent rectangular index, and number of vertices (Angel, Parent, & Civco, 2010;Basaraner & Cetinkaya, 2017).
The contextual metricscalculated for each building from buildings that are in the buffer around itare: number of neighbours, distances, area of footprints, and ratio neighbour height to distance.The last three are a list, which is encapsulated using eight statistical measures: minimum, median, mean, maximum, sum, standard deviation, index of dispersion (D; also known as variance-to-mean ratio), and coefficient of variation (CV; also known as relative standard deviation).
In total, there are 43 indicators at the building-level, which are doubled to 86 if considering their ranks in the corresponding zone.However, in various ways implementing them, such number may actually multiply.For example, in our implementation (Section 4), we calculate buffers of three fixed size (25, 50, and 100 m), increasing the number of metrics.Further, in our calculations, we use multiple zones (e.g.grid, district, city, province), thus, the ranks of each metric are then computed for each hierarchical zone, multiplying the number of indicators.This example suggests why it might be difficult to create a list of indicators, and how the addition of an indicator traverses multiple levels, compounding complexity despite the seemingly straightforward nature of building indicators.

Aggregated indicators
The urban form is most often studied in an aggregated manner, at irregular or regular zones such as a plot, administrative unit, and grid cell (tile).The aggregated measures stem from the building-level metrics (Section 3.3.1) of all buildings in the zone and are summarised with descriptive statistics, which we have listed in the previous section.Such indicators are simply permutations of descriptive statistics and the introduced metrics, e.g.mean shape complexity of all buildings in the zone.This outlook supports many studies, including those that focus on understanding the entropy of an indicator, e.g.variation of heights of buildings in a study area, which is important for climate change mitigation (Adelia, Yuan, Liu, & Shan, 2019;Usui, 2020).Such variation may be measured by one of the calculated dispersion metrics (e.g.index of dispersion) or by understanding their range (i.e.maximum minus minimum).Further, we have encountered studies that use the minimum and maximum value of an indicator in a zone as an aggregated measure (Cao, Luan, Liu, & Wang, 2021), which we support as well.Permuting the combinations should allow computing most, if not all, such derivative indicators that are calculated as an integration of two or more metrics.
We have noticed that while summary statistics of certain indicators are common, their permutations (all summary statistics for all indicators) are not all covered in literature.Therefore, contributing to the field, our approach expands the identified indicators, potentially revealing the usability of new indicators, which is advantageous for supporting derivative indicators that we have encountered in literature, e.g. a combination between using the mean building area with the building height in an area (Cao et al., 2020).In our implementation, we have excluded computing summary statistics where it is not meaningful to do so (e.g.sum of number of floors of all buildings).Thus, in Table 2 we denote which of the indicators we have included in our implementation.Some of them illustrated in Fig. 3.
In theory, the catalogue has 177 indicators at the zone level.When ranks are considered, the list doubles to 354.Further, note that each of the contextual indicators that is summarised in eight ways (see Table 1) may be expanded to have these statistics aggregated further according to the same descriptive statistics (e.g.coefficient of variation of the mean distance to buffered neighbours in a zone), and with ranks added, that results in 690 indicators at this level.
In practice, it is not sensible to compute them all, especially when having multiple levels of aggregation as in our implementation.Therefore, in the tool, we do not include the 'double nested' summary indicators, and consider only the mean values when computing the ranks.
We believe that the presented framework and list of urban form indicators is the most exhaustive one presented in literature hitherto, to the extent of our knowledge, and with the addition of new ones, presents a contribution and novelty in the field.

Overview
Our implementation includes the development of a software to compute urban form indicators using building footprint data and zone data, and a large-scale high-resolution dataset that was generated using the tool.Both are described together in this section to ease the understanding of the process.
In our approach, we decide to rely on a database management system for all the steps (ingesting and processing data, computation of indicators, and their storage), which has not been investigated before.We do that for multiple reasons: efficiency, organisation, robustness, modularity, scalability, and extraction.A database also allows upgrade with new datasets, and easy querying and extraction of only the data that are required, meaning that the generated datasets can be used in a wide range of statistical and geospatial software, not being confined to one software tool or programming language.Further, this approach means that our solution is cross-platform.
The tool that we develop is a sequential set of Python and SQL scripts that are used to set the scene and process input data (building footprints and defined zones) to build a PostgreSQL database spatially enabled with PostGIS.The Python scripts allow a high degree of customisation as a number of indicators can be toggled on or off.Thereafter, a series of hundreds of statements is run to compute the indicators and store them in a structured hierarchy of tables.Further, export scripts are prepared as well to allow exporting data.Having a database approach does not exclude user-friendly desktop softwarethese scripts enable export of the database in many different geospatial formats that can be plugged in directly in nearly all GIS software and programming environments, leveraging the best of both worlds.The architecture of GBMI is illustrated in Fig. 4.

Table 1
List of indicators at the building level (both the independent and contextual instances, together with their ranks at the corresponding zone).-Same 8 descriptive statistics as above a MBR -Minimum Bounding Rectangle.b The size of the buffer varies in literature.In our implementation, we create three buffers using the following values: 25, 50, and 100 m. c The last contextual indicator measures the ratio between the average height of buildings in a buffer and the distance among them.
F. Biljecki and Y.S. Chow Our implementation takes primarily OpenStreetMap as the source of building data, and two particular multi-scale zonal datasets (Section 4.2).However, the tool can be used with other types of building and zone data (e.g. if researchers have a more suitable building and/or zonal dataset, they may use our method with their data).
Both the dataset and the code used to generate it have been released as open data and open-sourced, under the Creative Commons license.As much as possible, we have designed our work to adhere to the principles laid out by the growing initiatives for openness and reproducibility in GIScience (Nüst et al., 2018;Wilson et al., 2021).The code and dataset

Zones
The tool supports both irregular (e.g.plots, vectors) and regular (e.g.grids, rasters) datasets for zones.Either of the two is required, but we have computed both for the sake of completeness and demonstration of both variants of zones.An advantage of using an administrative dataset is that we can enrich a building dataset with additional information from the dataset, such as the name of the district in which it is located, facilitating querying (e.g. when one needs to examine the data of a particular place) and linking to other data enabling studies across multiple disciplines.

Administrative boundaries (GADM).
The global administrative boundaries is obtained from the Database of Global Administrative Areas (GADM), 1 an open dataset that provides administrative hierarchical divisions starting from level 0, the national level, up to 6 levels, e. g. districts.We have loaded all the zones at all levels, constructing a multi-scale database.A hint of these boundaries is given in Section 5 in which we provide maps of indicators aggregated at administrative levels.
Grid (WorldPop).For the grid, we have used a multi-scale global raster released by WorldPop (Lloyd et al., 2019;Lloyd, Sorichetta, & Tatem, 2017;Tatem, 2017;WorldPop, 2018), at a resolution of arcseconds (approx. 1 km at the equator).The raster contains million cells, with an average size of 0.6 km 2 , with a population estimate for each one, based on dasymetric redistribution (Stevens, Gaughan, Linard, & Tatem, 2015).We also include a finer variant of the raster, at 100 m resolution, providing flexibility to users.Together with the levels from the administrative dataset, we have 8 zones, thus, we have layers of aggregated metrics (Section 3.3.2).While we could have created our own grid, we decided to use an existing one by an established project such as WorldPop, easing linking to their data and further analyses.As the tool is flexible when it comes to the input vector data for zones, it allows other input data such as smaller plots and morphological cells when more appropriate for certain studies (Fleischmann, 2019;Fleischmann, Feliciotti, Romice, & Porta, 2020).

Computing the indicators
The indicators are computed with SQL statements and dozens of tables are generated for each hierarchy.Beyond the indicators that we have listed in Section 3.3 (barring the denoted exceptions) and despite semantic building data rarely discussed in literature, our tool regards attribute data and preserves it should a need to use it arise.For example, a dataset on the age of each building is used by Li, Koks, et al. (2020) to calculate the degree of mixing of building ages arguing that such metric c H/D indicates the ratio height-to-distance.
1 http://gadm.org/.By implementing three sizes of buffers (25, 50, and 100 m), triplicate metrics of the same indicator have been computed (e.g. the mean distance between buildings in a buffer is available thrice), supporting studies with variable buffer zones (Milojevic-Dupont et al., 2020;Zhang, Cui, & Song, 2020).All these parameters are flexible and easy to modify.

Exporting the data and repository
The computed indicators may be straightforwardly queried in the database.The indicators can be exported at different levels and in multiple formats: CSV, shapefile, GeoPackage, and GeoTIFF.The documentation contains examples of several scripts for querying and exporting data into different data formats, both tabular and geospatial.
To facilitate data distribution and release a ready-to-use dataset, we have computed a dataset for a number of cities, and released it as open data in a repository.The database covers cities and countries which we attested to being well mapped in OSM, and the list is continuously growing with the addition of new locations, as the repository is intended to be a continuous development.In the Section 5, we give examples of a subset of the repository.

Examples of data and analyses
This section provides an insight in the implementation described in Section 4 by showcasing the data for more than a dozen cities around the world with diverse cultural, geographical, and morphological signatures, and we also included a few countries to affirm the large-scale nature of the work.These examples serve also as an insight in their interlinking, as in them, we will demonstrate breaking down the nationwide data at multiple administrative levels further down in the hierarchy (regions, cities, neighbourhoods), and some advantages of managing the data in a database, e.g.querying to extract only the data we need for use in other software.
With these analyses, we seek to investigate the following research questions: what is the spatial distribution of the urban form in a selected set of cities? How does it vary among urban areas within the same country and around the world?How are the indicators associated among themselves?Can we link the computed metrics with various socioeconomic variables and what are their relationships?
Starting from the basics, Fig. 5 presents the computed data from London across multiple levels of indicators.First, the compactness of building footprints in the study area is visualised.This metric is an example of an indicator that is computed for each building in isolation.In contrast, in the second map, we visualise an indicator that requires information on other buildings, i.e. the surrounding contextnumber of neighbours in 100 m buffer.Such indicators are thereafter aggregated at corresponding grid (cell) and administrative areas together with all other buildings in the same zone.The two maps on the right are examples of these: site coverage (share of area of a zone covered by Fig. 4. Flow of the system.The datasets in the square brackets are those that we have used for generating the dataset, while the software is agnostic supporting other formats and datasets as well.All the computations are performed and kept in the database, which allows easy query and extraction of data.Our export scripts support multiple geospatial formats.building footprints), both in a gridded aggregation (100 m) cells and administrative zones.
Moving on to cross-city analyses, in Table 3, we present a few summary statistics for a selected set of cities.Note that two indicators (the 10th and 90th percentile of the footprint area) are not standard part of GBMI.They double as an example of the flexibility of our work, enabling easy computation of new indicators if necessary.
Another advantage of the database approach is queryingwe can find particular examples of zones or buildings we are interested in based on the series of metrics that are computed, e.g. a building in a city that is around both the median footprint area and median complexity (i.e.average building in a city), or finding extreme cases of buildings and zones.
In Fig. 6, we visualise the distribution of site coverage for cities, revealing characteristic urban fingerprints and asserting the advantage of our structured and hierarchical approach for conducting comparative analyses.
Fig. 7 presents a nation-wide dataset and the hierarchical structure of the work, with a visualisation of an indicator, exposing differences in the urban form at a national scale.The administrative dataset matches the subdivisions of Switzerland: cantons (level 1), districts (level 2), and sub-districts (level 3).
Urban form indicators and other proxies are often used for studying economic vibrancy and establishing the relationship with demographic, real estate, political, and other characteristics of urban areas (Botta & Gutiérrez-Roig, 2021;Chen, Zhang, & Zheng, 2021;Gebru et al., 2017;Kim, 2020;Lindenthal, 2017;Xia, Yeh, & Zhang, 2020).The administrative records in our dataset allow associating it with a variety of datasets that are available at different levels of administrative divisions to support such research.Associating the same nation-wide dataset with corresponding socio-economic indicators obtained from the federal government reveals their associations (Fig. 8).For example, the percentage of employment in an area appears to be moderately correlated with multiple disparate indicators.
Next, we show another hint at comparative analyses among cities (Fig. 9).We extract several metrics of three cities using another nationwide dataset (New Zealand), by querying the database, and visualise them in a radar chart.The cities are compared across several metrics that are computed at cells within each city.The values have been normalised to enable comparison, thanks to which we can observe similarities and differences among cities.
Fig. 10 gives an insight in a contextual building-level indicator aggregated to the grid: the distance of a building to its nearest neighbour in its 25 m buffer, with the indicator aggregated at 100 m cells in Singapore.The map reveals distinct patterns of the urban form, which are corroborated with a cursory glance at the texture of the city-statethe western part (Clementi, top left example), is a typical high-rise setting in which buildings are well separated from each other.On the other hand, Katong (at the bottom right) is known for its densely packed shophouses, terraced houses, and detached properties.Such indicators are crucial in microclimate simulations.
Shifting our attention to shapes, Fig. 11 visualises the distribution of complexities of buildings by city.It reveals distinct patterns of building morphology across the cities, which is interesting to associate with local cultural and architectural characteristics.Ulaanbaatar stands out, likely due to its large number of gers, circular dwellings used by nomads in Mongolia.The two Dutch cities (Delft and Amsterdam) are next to each other in the plot, suggesting that a similar architecture predominates nationally.Further, these two cities exhibit a bimodal distribution, possibly due to the mix of new and old architecture, i.e. the famous canal houses (grachtenpand), whose architecture has been shaped by taxation, presenting interesting examples how the effect of regulations may be reflected in the quantitative urban form indicators that we compute.
So far, we have presented examples of single indicators.Studying the multiplex relationships among indicators is common (Basaraner & Cetinkaya, 2017;Gisbert, Mart, & Gielen, 2017;Schwarz, 2010).As an example, we focus on three sets of association between various indicators.
First, in Fig. 12 we present scatter plots revealing the association of footprint complexity and the density of buildings in several cities, derived at the 100 m level.The plots suggests a common property of the urban form that the more buildings there are in a zone, the less complex they are.
Second, Fig. 13 indicates the correlation of dispersion metrics of the same indicator.Standard deviation (SD), index of dispersion (D), and

Table 3
Summary statistics per cityfootprint area (10th percentile, mean, 90th percentile; in square metres), mean compactness, mean length (metres), and standard deviation of complexity.Two new indicators have been computed additionally to demonstrate the extensibility of the work.Third, plotting the measures of dispersion of two different indicators together enables us to understand their relation.Fig. 14 provides the relationship of the variation of the size of a building (index of dispersion of footprint areas) and their shape (index of dispersion of footprint complexities) in a 100 m cell.The values are aggregated at the city level, and help in understanding the diversity of buildings across multiple cities, from detecting cookie cutter neighbourhoods or cities to identifying those with a highly variable architecture.
To give an impression of the scale of the data presented in this section, we note that 110 tables are produced in the database and the exported files of the exemplified administrative regions take 223GB in size.To generate the examples, we have used R and QGIS, making the entire pipeline of GBMI and this paper entirely free and open-source Fig. 7. Hierarchical and structured integration of data.These plots and maps were derived from footprint areas of all buildings in Switzerland, and aggregated at multiple levels.Note that for space constraints the plot does not indicate all levels, and in the lowest level we omitted some zones.(PostgreSQL/PostGIS, OpenStreetMap and zonal data, and the mentioned statistical and geospatial software).

Limitations
Our work is mainly limited by shortcomings of input data, which are out of our control.In our implementation, we focus on OSM, as it is a centralised repository of building data from around the world, with some locations mapped better than others.While in many places around the world, especially urbanised areas, building footprints have reached full completeness, in many areas heterogeneous completeness has been noticed by researchers (Varentsov, Samsonov, & Demuzere, 2020).For some of the indicators, imperfect building completeness may not be a significant issue.Provided that the sample is sufficiently representative, indicators that measure central tendency and dispersion, such as mean area and their standard deviation, may be accurate even if the full completeness is not reached.Further, the rapid growth of OSM, including in less developed countries (So & Duarte, 2020;Yuan et al., 2018), assures the addition of more cities in the repository.Also, recent datasets openly released by companies, such as Microsoft (United States) and Google (Africa) (Heris, Foks, et al., 2020;Sirko et al., 2021), and the growing volume of data released openly by governments (Biljecki, Chew, Milojevic-Dupont, & Creutzig, 2021), provide confidence that much of the globe could be processed in the future and included in our repository.The completeness of attributes is another issue that will have a direct effect on the quality of the generated data, but there is some optimism as an increasing number of locations around the world, including in developing countries, is being covered with high completeness of semantic information (Biljecki, 2020).Building height datasets are a bottleneck in many areas around the world, diminishing the scalability of indicators that require the building height.Data on building heights that are fully complete are in some cases available from authoritative (government) datasets in form of building footprints enriched with attribute information on heights or as point clouds obtained from airborne lidar, but these are limited to few geographic areas.Despite commendable advancements in large-scale mapping of buildings using satellite remote sensing techniques, there are still no global open datasets on heights of individual buildings, and many instances are generated at a coarse spatial resolution (e.g.average building height at the scale of a block), limited in coverage, and/or their positional accuracy may not be fully adequate for studying the urban form at high resolution (Chen, Zhang, Wong, & Ignatius, 2020;Esch et al., 2022;Frantz et al., 2021;Geis et al., 2019;Li et al., 2020;Li, Herfort, Huang, Zia, & Zipf, 2020;Tian, Tsendbazar, van Leeuwen, Fensholt, & Herold, 2022;Zhu et al., 2022).This limitation solely pertains to our input dataset (OSM) and geographies with completeness issues.Recent open data developments such as the work of Dukai et al. (2020);Dukai, Peters, Vitalis, van Liempt, andStoter (2021), Peters, Dukai, Vitalis, van Liempt, andStoter (2022), and Yang and Zhao (2022), which offer nation-wide building open datasets including heights of buildings, give some assurance.As our work is designed to be data-agnostic, supporting other input geospatial datasets such as high quality instances released by governments and researchers, it remains an issue only in the generated dataset rather than the developed methodology, and using such datasets will alleviate this issue.

Directions for future work
While datasets such as point clouds and higher LoD 3D models give the means to additional indicators (Bonczak & Kontokosta, 2019;Chen, Qiu, et al., 2020), in the context of our work, it is unlikely that highly detailed 3D data will become available widely any time soon.We focus on the form of data that is available at a large scale, and we believe that 2D building footprints with attributes will continue to reign for a long time in such studies.
In our case, we have used the particular grid as it based on a dataset containing population data, potentially being useful for further analyses analysing the relationships with socio-economic variables (an example of such analysis is illustrated in Fig. 8).Notwithstanding, our tool allows using any other grid as input, so the database could be further amalgamated with other types of data from other sources that are used in urban form studies, e.g.energy consumption (Chen, Wu, & Biljecki, 2021), air temperature (Xu, Chen, Zhou, Wu, & Liu, 2020), land cover and land use (Byahut, Patel, & Mehta, 2020;Hertwig et al., 2020), wind (Allen-Dumas et al., 2020), and liveability and sustainability measures (Benita, Kalashnikov, & Tunçer, 2021;Patias, Rowe, Cavazzi, & Arribas-Bel, 2021).
In future work, it might be beneficial to recompute the indicators on another dataset of spatial boundaries and release such data openly as well, making use of other freely available large scale gridded population datasets (Palacios-Lopez et al., 2019) and aligning it to complementary work in urban morphology such as a recently released global open dataset on street network indicators (Boeing, 2021).Connecting our datasets with others may support an even wider range of studies that rely on indicators from multiple topographic features (Gamero-Salinas, Kishnani, Monge-Barrio, López-Fidalgo, & Sánchez-Ostiz, 2020;Lindberg & Grimmond, 2011;Song, Zhang, & Han, 2021).Further, recent developments in urban morphology employ deep learning and street view imagery (Biljecki & Ito, 2021;Chen, Zhang, & Zheng, 2021;Wurm et al., 2021).A viable direction for future work would be to bring them together and complement our approach.
Formalised data integration is another potential future direction of this project.As building datasets are disseminated in formats that allow storing attribute content in a standardised manner to facilitate data exchange and interoperability, we will investigate integrating the metrics in the building dataset using a more formal approach.For example, CityGML, a prominent standard for managing and exchanging 3D building data (Kutzner, Chaturvedi, & Kolbe, 2020), allows extending it with schemas to store additional attributes (Biljecki et al., 2021), which may be applicable to the metrics we computed.
Finally, data on the urban form have been involved in temporal studies to study its evolution, with those derived with remote sensing techniques dominating such papers (Colucci, Ruvo, Lingua, Matrone, & Rizzo, 2020;Zhao, Weng, & Hersperger, 2020).Our approach enables temporal studies, as data from different periods may be processed separately and compared in a consistent manner.However, in future work, it might be beneficial to enable storing such data concurrently and provide better support for it.

Conclusions
Spatial data has for a long time been instrumental in characterising urban areas with regard to land use, street networks, and buildings, but not without research gaps and limitations.This work, focusing on building data and accompanying their unprecedented growth, presents a triplex contribution in understanding the structure of cities.
First, we compiled a thorough overview of metrics to quantify the urban form, based on an exploration of recent work and supplemented it with an introduction of a new array of indicators, forming a structured approach.We believe that this is the most comprehensive list on building-related morphological indicators hitherto and thatgiven the stringent nature of the reviewour paper doubles also as a 'mini review paper', and we have also shared some observations encountered during the review.We have derived a comprehensive list of hundreds of indicators at multiple scales.While we do not establish a claim that our work serves as an authoritative and conclusive list of metrics of building morphology and while we do not attempt to impose a consensus on them, we hope that the framework will contribute towards their standardisation and definitive catalogue.Considering the breadth of indicators we include in GBMI, and that our literature review reveals that nearly all studies use less than 10 indicators, we believe that it may inspire researchers to experiment with metrics they would not have taken into account otherwise, potentially furthering applications and revealing new insights.Next, we believe that a combination of multiple metrics may introduce new applications in studying the urban form.Furthermore, fellow researchers are welcome to add new indicators in our modular and extensible open-source pipeline.We also believe that our inventory will serve researchers working on other types of urban data such as point clouds, from which many of the same set of indicators can be extracted.
Second, a key contribution is the implementation of a free and opensource solution, which is database-based, a novelty in the field.It supports the computation of the indicators at a very large scale and allowing using different input datasets and varying parameters, serving a multitude of studies across a range of disciplines.Statistic computations could be done directly in the database or exported to a set of files and used in statistical and geospatial software (Section 5).
Third, generation of detailed ready-to-use datasets, and their open release, providing a convenient instance that may be found useful by others and minimising their efforts in computing such data on their own, especially if they do not have access to high performance computing facilities.Considering the growing coverage of the dataset, we believe that our work may contribute towards having more studies that include more than one city, and cover cities that have previously not been subject of studies, as our literature review (Section 3.3) reveals that cross-city studies are still not common and that large swaths of land are not subject of research.The examples in Section 5 present just a fraction of the possibilities.Further, we believe the same examples present a scientific contribution as some analyses have not been conducted before.Thanks to the administrative tags, bridges to other data can be established, with the work leading to new explorations and revealing previously unexplored relationships with a variety of indicators, some of which are novel.
In conclusion, one of the principal contributions of GBMI is that it presents a method for formalised, structured, modular, and extensible computation and management of urban indicators at a massive scale and high resolution, while the precomputed dataset allows easy and fast comparative studies in a variety of software, which is an advancement with respect to related work.
Our work is designed to carry on as a continuous development, primarily continuing the computations for further locations.

Fig. 1 .
Fig. 1.Illustration of independent indicators at the building level.

Fig. 2 .
Fig. 2. Illustration of contextual aspects computed at the building level based on its surroundings.

F
. Biljecki and Y.S. Chow are released openly.Both contributions are supported with extensive documentation.

Fig. 5 .
Fig. 5. Example of computed indicators at different levels for a part of London.Sources of building and administrative data: (c) OpenStreetMap contributors, GADM.

Fig. 6 .
Fig. 6.Site coverage distribution across selected cities around the world.

F
.Biljecki and Y.S. Chow   coefficient of variation (CV) are not necessarily always perfectly correlated, affirming the importance of computing all three of them.The importance of such multiplicity is also asserted by others, e.g.Wang, Geoffroy, and Bonhomme (2021) use multiple descriptive statistics of the same indicator.

Fig. 8 .
Fig. 8. Correlation between a selected set of indicators computed at the municipality level in Switzerland (level 3 in Fig. 7) and a set of demographic data available as open data by the federal government.Source of statistical data: Swiss Federal Statistical Office -Regional portraits 2021: key data of all communes.

Fig. 9 .
Fig. 9.A visualisation of multivariate data extracted from the dataset of New Zealand.Legend of the indicators (clockwise from top): width of the minimum bounding rectangle (mean), length of the minimum bounding rectangle (coefficient of variation and mean), azimuth (mean), equivalent rectangular index (mean), compactness (mean), and complexity (mean).

Fig. 10 .
Fig. 10.A map visualising the spatial pattern of the minimum distance among buildings in high-resolution tiles across Singapore.The photographs are courtesy of Unsplash contributors.

Fig. 11 .
Fig. 11.Distribution of complexities of building footprints by city.

Fig. 12 .
Fig. 12.The relationship between the complexity of buildings and their normalised number in a zone.These two aggregated indicators are usually negatively correlated, with some exceptions.

Fig. 13 .
Fig. 13.The relationship between three summary indicators suggesting the dispersion of a particular indicator (complexity).

Fig. 14 .
Fig. 14.Various urban form metrics, when aggregated at city level, enable us understanding flavours of cities and facilitate comparisons.

Table 2
Urban form measures at the aggregated level, derived from the indicators of the buildings in the corresponding area.