Building(s and) Cities: Delineating Urban Areas with a Machine Learning Algorithm

This paper proposes a novel methodology for delineating urban areas based on a machine learning algorithm that groups buildings within portions of space of sufficient density. To do so, we use the precise geolocation of all 12 million buildings in Spain. We exploit building heights to create a new dimension for urban areas, namely, the vertical land, which provides a more accurate measure of their size. To better understand their internal structure and to illustrate an additional use for our algorithm, we also identify employment centers within the delineated urban areas. We test the robustness of our method and compare our urban areas to other delineations obtained using administrative borders and commuting-based patterns. We show that: 1) our urban areas are more similar to the commuting-based delineations than the administrative boundaries but that they are more precisely measured; 2) when analyzing the urban areas' size distribution, Zipf's law appears to hold for their population, surface and vertical land; and 3) the impact of transportation improvements on the size of the urban areas is not underestimated.


Introduction
Understanding city size and why cities grow are two issues that have attracted the growing interest of researchers in recent decades (Duranton and Puga, 2014). However, in their work one of the main challenges urban economists face, together with the scarcity of data, is just how a city should be defined. Until recently, most available data were provided at the local administrative or local political unit level; yet, using these data has proved problematic: First, because the land size and land use of these units are diverse and the population and economic activity within them are not equally distributed (presenting a mix of both rural and urban land) and, second, because cities can grow beyond their borders spreading into the surrounding area.
Given these circumstances, a city definition based on economic characteristics makes little sense.
Unfortunately, administrative areas are often used for policy-making purposes but, here again, such areas do not usually reflect any functional reality and may even compromise the effectiveness of resulting policies (Briant et al., 2010). For these motives, an ability to define urban areas more accurately should aid analyses of the heterogeneous nature of policy impacts within the same administrative/political boundaries and of spillovers across functional/economic areas. It should also be helpful in examining the sensitivity of policy evaluation to how economic areas of interest are defined.
We contribute to the literature by developing a new methodology for delineating urban areas.
By drawing on a unique database on the precise geolocation of all 12 million buildings in Spain, we design a density-based machine learning algorithm to group buildings within portions of space of sufficient density. In line with Rozenfeld et al. (2011) and Bellefon et al. (2019), our objective is to delineate urban areas following a bottom-up approach. These two papers both define cities through the aggregation of cells based on a density criterion; however, here, we do not rely on micro-aggregations to define the boundaries but use the location of each of the buildings in Spanish territory as our first source of information. One of the improvements provided by our method is that our algorithm uses only 10% of the entire sample at a time (with 1,000 replications) and then extrapolates the structure captured in that subsample to the rest of the dataset. To ensure that the urban areas delineated in this fashion are sufficiently robust, we consider that the buildings belong to an urban area if they are assigned to that urban area in 90% of these replications. We run different tests to provide evidence of the stability of our algorithm and the result of our method is the delineation of 717 urban areas accounting for 75% of the population and occupying less than 5% of the whole territory.
Our dataset provides additional information that we are able to exploit so as to better characterize Spain's urban areas. First, we use information on building heights (measured as the number of floors). Recent papers by Ahlfeldt and McMillen (2018), Brueckner et al. (2017) and Liu et al. (2018) highlight the importance of taking this measure into account to understand the shape of cities and the impact the height of buildings can have on land and housing values.
Here, we calculate the vertical land of the delineated urban areas as the footprint of the buildings (horizontal land) multiplied by the number of floors. Our results indicate that the vertical land multiplies by three the amount of developed land. This analysis of the vertical land provides a different perspective on city size (especially in the case of the a country's largest urban areas).
Second, we dispose of information on the use of the buildings (residential vs non-residential) and the methodology we adopt also allows us to define the employment centers within our delineated urban areas. Thus, we can identify 2,056 employment centers, representing 63% of the total vertical land. However, only 70 centers house more than 10,000 jobs and just seven are home to more than 50,000 jobs. These results are in line with the well-documented evidence that economic activity within urban areas is markedly more concentrated and presents different patterns of location to those presented by residential areas. This exercise highlights an additional use of our algorithm, namely, it provides a better understanding of a city's internal structure.
Various methodologies have been developed that define urban areas as a collection of smaller units. A common approach in this regard relies on commuting patterns. Here, as long as population mobility plays a key role both in an economic system's performance and in the daily life of individuals, the journey-to-work relationship between two areas allows researchers to determine whether they belong to the same local labor market and, hence, if they can be considered to form part of the same urban area (see Duranton, 2015, for a review). However, because of the lack of commuting data for some developing countries, an increasing number of papers in recent years have opted to use information on the distance between lights in nighttime satellite images to delineate urban areas. Henderson et al. (2018) provide an overview of the applications of night light data in economics and Dingel et al. (2019) use such data to define metropolitan areas. Similarly, instead of using night light data a number of studies employ land cover data. This information, provided by NASA and, more recently, by the European Space Agency (ESA), among others is also available at the global scale. Examples of this approach include Chowdhury et al. (2018), who use such data to estimate urban areas, and Baragwanath et al. (2019), who use them to define urban markets. Finally, recent developments in communication technologies have facilitated studies of how people use space in cities, providing an important new tool for urban research, especially for areas where data are scarce or simply not available. The work of Louail et al. (2014) and Büchel and von Ehrlich (2019) are good examples of how cell phone data records can be used to understand the spatial structure of cities.
The delineation method proposed here offers several advantages. First, it does not attempt to aggregate administrative units. Second, it only takes into consideration areas that have been developed (i.e. buildings), ignoring undeveloped regions of the territory. Third, detailed information about the buildings allows us to characterize more accurately the structure of the city in terms of its verticality and the location of residential and non-residential activities. Fourth, our approach is more robust than other methodologies and allows to explore the stability of a boundary because it relies on the computation of several candidate solutions that we then combine to arrive at our preferred solution. And, finally, if the appropriate information is available, our algorithm can be replicated for other countries for two reasons: a) it is computationally scalable to large datasets and b) buildings are homogenous units across different countries.
The rest of this paper is organized in six sections and three appendices. In Section 2 we describe the data. Section 3 explains the methodology employed to delineate the urban areas.
Section 4 presents the results. In Section 5 we compare our delineated urban areas with other delineations. Finally, in Section 6, we highlight the most important findings and draw our final conclusions. Appendix A includes the technical details of our algorithm; Appendix B reports some robustness checks of the algorithm; and, Appendix C shows summary statistics.

Data
Our dataset, provided by the Spanish Cadaster (Dirección General del Cadastro), comprises a unique three-dimensional description of all buildings in Spain 1 , geolocated with metric precision for the year 2017. The Cadaster is an administrative registry, supervised by the Ministry of Finance, that contains a description of all real estate data (urban and rural). In other words, the Cadaster constitutes a record of the physical, legal and economic characteristics of all the properties in the country, with one of its main uses being to provide accurate information for the tax system. For example, one of the main local taxes in Spain, the property tax, is dependent on the information contained in this register. In this regard, it should be noted that the law holds that the registration of every property is mandatory and free of charge. This rule guarantees that the data cover the universe of buildings. All unprotected data referring to each property, identified by its cadastral reference, can be downloaded from the Cadaster at http://www.sedecatastro.gob.es. These data include all information about the building except that referring to its ownership and value. For each building, this url provides access to an online form that provides basic information and which can be downloaded in PDF format. It also gives access to a detailed map of the building that can be downloaded in GIS format and detailed information about such characteristics as: 1) the building's exact location and total built surface, 2) the year of construction, 3) its use (residential or non-residential), 4) height (number of floors above ground), 5) its footprint (m 2 ), and 6) the number of total units that are contained in each building and, specifically, the number of residential units (dwellings). By way of illustration, Figure 1 shows the online form and footprint map of Antoni Gaudí's well-known building, 'La Pedrera', in Barcelona. The online form indicates that it was built between 1906 and 1912, and records its postal address, footprint (1.808 km 2 ) and main use (residential).

Delineating urban areas with buildings and a machine learning algorithm
We delineate urban areas as portions of land with a minimum, uninterrupted level of building density. To do so, we develop a novel approach as an extension of a well-understood machine learning algorithm (DBSCAN; Ester et al., 1996) that we name 'Approximate DBSCAN' (A-DBSCAN) 2 . Its purpose is to detect robust clusters of buildings that reach a minimum density threshold. To achieve this, our algorithm requires two input parameters: first, the minimum number of buildings that each urban area (cluster) needs to include to be considered so; and, second, a maximum search distance in which to count surrounding buildings to check whether the first criterion is satisfied. Once a set of buildings is identified as a cluster, our method draws its surrounding boundary using the α-shape algorithm Edelsbrunner et al. (1983), a widely used approach to delineate tight bounding boxes. evaluates the minimum number criterion ( Figure 4b). In this case, the criterion is satisfied and this building is labelled as a 'core' point. The algorithm continues to run by drawing circles around the other points and evaluating the minimum number criterion. Figure 4c shows all the buildings that satisfy the minimum number criterion and which are core points. All these buildings/points are reachable, that is, there is a direct connection from one building to another or an indirect link via paths that cross through other core buildings. Figure 4d shows other buildings (in blue) that do not satisfy the minimum number criterion but which are reachable from some core buildings (i.e. they are within the core building circles).
These are the so-called 'border' points and they also belong to the delineated urban area. Finally, Figure 4e shows a building (in green) which, after drawing the circle with a radius equal to the distance threshold, does not satisfy the minimum number criterion. This type of building/point is the so-called 'noise' point and does not belong to the delineated urban area.
The final delineated urban area ( Figure 4f) is made up of core and border buildings (red and blue points) but does not include any noise buildings (green points). By definition, the core of the delineated urban area has the highest density of buildings and the border area the lowest.
As a result, our definition of the urban area is in line with the traditional idea of a city, that is, a place with high levels of building density and with an urban spatial structure in which building density decreases towards its the boundaries.
When applying the algorithm to our building dataset, we set the two parameters -that is, the minimum number of buildings and the distance threshold -based on knowledge and evidence from the Spanish urban system. First, we set the minimum number to 2,000 buildings in order to ensure the urban areas we delineate house, at least, 5,000 people. On average, the Spanish household comprises 2.5 members (Instituto Nacional de Estadística, 2018). This minimum threshold is set assuming that the average building is a single family house, which is not exactly the case of some areas of Spain. However, we do not want to underestimate, or rule out altogether, the newer settlements built largely in accordance with that model of urban development. For the maximum distance threshold, our preferred results use 2,000 m. This parameter is chosen based on information about Spanish commuting patterns. The average daily distance commuted by a person in Spain's biggest cities is approximately 4 km (Cascajo et al., 2018) which, divided by two, yields 2 km per trip. As mentioned, our approach uses a machine learning algorithm, based on the original DB-SCAN developed by Ester et al. (1996). DBSCAN facilitates cluster identification based on measures of density without the need for auxiliary geographies. However, from a computational point of view, it does not scale well and, more importantly, it does not include any mechanism to ensure the robustness of the clusters (or, in our case, urban areas). To address this shortcoming, we specifically developed the A-DBSCAN extension. Thus, we propose turning the original algorithm into an ensemble that combines a number of exact DBSCAN runs (1,000 replications) on random subsamples (10% of the original dataset) that are expanded to the rest of the sample through a nearest-neighbor algorithm. These solutions are then summarized in one final set of urban areas (clusters) in which buildings are classified based on their most common occurrence.
That is, to ensure robustness, the buildings belonging to an urban area are those assigned to that urban area in at least 90% of the replications. Otherwise, these buildings are classified as noise points and do not belong to any urban area. A more detailed and technical explanation of We perform several experiments to explore the degree of agreement between our algorithm and the original DBSCAN. An ideal test in this context would be to compare the results of the two algorithms when applied to the entire dataset of Spanish buildings. However, this is not computationally feasible (indeed, one of the reasons for the development of A-DBSCAN is precisely to overcome this computational hurdle). Instead, we consider different parts of Spain characterized by varying numbers of buildings and population, urban areas of different size, and by different geographical features. We are then able to run both algorithms on these subsets and to compare their delineated urban areas. To do so, we use the 'adjusted Rand index' (Hubert and Arabie, 1985), a measure of similarity between two groups or classifications that is widely used in machine learning. In our case, we compare the set of delineated urban areas and the buildings that make up each area when using (1) our algorithm (with the 1,000 replications) and (2) the original DBSCAN. Mathematically, where a is the number of buildings that are assigned to the same urban areas in (1) and (2); b is the number of buildings that are assigned to different urban areas in (1) and (2); c is the number of buildings that are assigned to the same urban areas in (1) and to different urban areas in (2); d is the number of buildings that are in different urban areas in (1) and in the same urban areas in (2). Intuitively, a + b can be considered as the number of agreements (i.e., buildings assigned to the same urban areas in (1) and (2)) and c + d as the number of disagreements (i.e., buildings assigned to different urban areas in (1) and (2)). As a result, the Rand index measures the ratio of agreements between the two methods over the total number of buildings, and its values range between 0, dissimilarity, and 1, maximum similarity 3 .
The first three columns in Table 2 present the results of these comparisons for six different parts of Spain. Column 1 shows their geographical location; column 2 reports the size of their population, and column 3 presents the corresponding Rand index. In general, the degree of similarity between the delineated urban areas when using the two methods is quite high, with the maximum value being recorded for Sevilla (with a 97% degree of similarity) 4 . This is remarkable because our algorithm uses only 10% of the entire sample to calculate the exact DBSCAN (with 3 We use the adjusted version of this index that corrects for the probability of buildings being assigned to the same urban areas by chance. To do so, we use the implementation in the Python library scikit-learn (Pedregosa et al., 2011). 4 Additional computations for other randomly selected parts of the country show values that are always above 80%.
1,000 replications), and then extrapolates the structure captured in that subset to the rest of the dataset. Altogether, these results provide evidence of the efficiency and effectiveness of our algorithm. An additional advantage of our algorithm is its sampling approach, since it allows us to explore the stability of the delineations. In other words, given that each of the final delineated urban areas are based on 1,000 replications, it is possible to quantify the degree of agreement between the 1,000 delineations for each urban area. The last three columns in Table 2 explore this dimension for the largest urban areas (column 4) found in each of the aforementioned parts of Spain. In column 5 we draw the final boundaries of each urban area (the thicker red line) on top of the delineations of each replication (thinner black lines). The figures allow us not only to compare the overall stability between delineations, but also to identify areas within a given urban area of greater and lesser stability. For instance, while A Coruña's northern side displays a high degree of agreement, its southern border is more variable across replications, suggesting a more nuanced boundary. This approach also identifies borderline cases associated with more disperse developments that meet the requirements imposed by the algorithm in a large number of replications but not in enough to grant final assignation into the urban area. Zaragoza is a good example of this situation.
To summarize these visual displays, we compute a Stability index, which is based on the average difference between the delineated area in each single replication and that of the final delineated urban area. We express its value as a percentage of the final surface to correct for city size effects: where A r is the surface of the boundary obtained in replication r,Å is the surface of the final delineation, and R is the total number of replications. This measure captures the extent to which individual boundaries drawn as part of our method spatially overlap with the final boundary chosen. Each individual drawing of the boundary might be larger than the final one (as illustrated in the visualizations in Table 2) and include buildings that do not form part of the final delineated urban area. However, the difference between the two offers a measure of the stability of the final delineation and of the extent to which urban development is clearly delimited in the periphery of an urban area or, on the contrary, the degree to which it fades away progressively. The index has a lower bound of zero for the case of complete stability, when all replications agree exactly, and is not upper bounded (the difference between the final delineation and each replication can be arbitrarily large). Column 6 in Table 2 reports the Stability indexes for the selected urban areas.
In general, the values show high degree of stability in the delineations. Once again, the case of Sevilla stands out as it shows the closest value to zero and, as a result, the highest degree of stability between the delineations. In contrast, and as mentioned when discussing the boundaries (column 5), less stability is found in Zaragoza because of a disperse development that is not assigned to the final delineation of its urban area.
In summary, our algorithm for delineating urban areas has certain advantages over other methods. First, it is density-based and, in combination with our building dataset, identifies urban areas that only contain continuous parcels of space where the building density exceeds a minimum threshold. As discussed, this feature is in line with the traditional idea of a city that has come into existence because of the agglomeration economies created by the high concentration of population and firms. On the other hand, delineations based on, for example, commuting and/or administrative boundaries include large areas of undeveloped land, which reduces the overall city density. Second, our delineated urban areas are spatially continuous collections of buildings rather than exogenous aggregations, such as grid cells or administrative boundaries.
Such ex-ante groupings may be necessary when there is no information available about individual locations or even justified in some specific cases but, generally, they imply a loss of granularity.
Furthermore, they can potentially distort the final conclusions of analyses based on them: the so-called 'Modifiable Areal Unit Problem' (MAUP) (Openshaw, 1984, Briant et al., 2010. Third, our algorithm is robust to marginal changes in the data and has a "built-in" approach to explore solution stability. Furthermore, similar to bootstrap estimates, the proposed sampling approach ensures the results are computationally scalable and, thus, feasible in large datasets like ours (with more than 12 million buildings).

Main results
As can be seen in Table 3, our method delineates 717 urban areas that account for approximately 75% of the Spanish population. These areas contain 5.7 million buildings (roughly half of Spain's total), made up of 26 million units (72% of the total). The sum of building footprints (the horizontal area) is 1,596 km 2 . When we also take into account the buildings' floor area (the vertical area), this figure is multiplied by three (4,869 km 2 ). Interestingly, the average number of floors in the buildings in these urban areas is three. When considering the buildings' footprint together with the land lying between the buildings (streets, roads, parks, etc.), the total surface of the delineated urban areas reaches 22,469 km 2 , which represents nearly 5% of the surface area of the whole of Spain. The use of the buildings of these urban areas is mainly residential (83.1% of the cases). As for the population size of the delineated urban areas (Table 3 Panel B), ten of them have more than 500,000 inhabitants and represent one third of the Spanish population.
Together with the next 47 biggest urban areas (those with a population between 100,000 and 500,000 inhabitants), they represent 52% of the whole Spanish population. The data for these ten urban areas seem to indicate that the biggest ones in terms of population have larger surfaces and vertical lands. For the delineated urban areas, the correlation between their population and surface is 0.89, while that between their population and vertical land is even bigger (0.98). These results are obtained when the algorithm considers the number of buildings to be found within a 2,000-meter threshold. However, we also calculated the algorithm modifying this distance to see how it affects our results.  . 1,500 m), we obtain 773 urban areas (with 70% of the population). In contrast, as the distance increases, we obtain fewer urban areas containing more population. For example, when the distance was greatest (i.e. 2,500 m), 699 urban areas are delineated containing 79% of the total Spanish population. Table 4   Notes: Population is computed using population grid data (1×1 km cells within the boundaries of our urban areas) from the 2011 Population Census. Surface is total land of the delineated urban area. Horizontal area refers to the sum of building footprints. Vertical area is the sum of floor footprints. Floors refers to the average number of floors. Figure 5 shows the geographical location of the 717 urban areas (with colors ranging from green to yellow with increasing density of buildings within the urban area). As can be seen, most of the urban areas are concentrated along the Mediterranean coast, and in the center and south of Spain. The smaller scale illustration in Figure 6 shows the delineated urban areas in the region of Barcelona.
It is interesting to compare our results with those reported by Bellefon et al. (2019)

Identifying employment centers within the urban areas
An interesting exercise to illustrate a further application of our method is the identification of employment centers within each of the urban areas. The goal of this exercise is to focus specifically on the concentration of economic activity. With this purpose in mind, we adopt a similar approach to that described in Section 3, albeit with some differences. Most importantly, we focus on units rather than on buildings, given that a single building may include several units (especially if it contains more than one floor). This is a prevalent feature of CBDs and other forms of employment concentration where firms and workers cluster to benefit from density. Using units instead of buildings also allows us to account for the difference between vertically dense areas and those clustered only horizontally. This implies, in the case of this exercise, working only with the non-residential units.
Although the identification mechanism is similar to that used when delineating city boundaries, a few changes have to be introduced. First, instead of running a single instance of the algorithm for the entire dataset, we apply the method to each urban area delineated in the previous stage. Second, for each dataset of urban area buildings, we run A-DBSCAN using 50% of the sample. We retain 90% as the stability threshold and 1,000 replications to generate the delineations. Third, while each point still represents a single building, we now weight them based on the number of non-residential units that the building houses. Finally, we adapt our two algorithm parameters (minimum number of buildings per urban area and distance threshold) to identify employment centers in accordance with methods proposed in the literature. The most frequently used are those based on density thresholds (Giuliano and Small, 1991, McMillen and Smith, 2003, Giuliano et al., 2007, Muñiz et al., 2008 and density peaks (McMillen, 2001, Redfearn, 2007, Garcia-López et al., 2017b, where an employment center is a place whose primary feature is a high density of workers (and certainly one with a density higher than that of nearby locations). Giuliano and Small (1991) and McMillen and Smith (2003) define this density as 2,500 employees per km 2 . Assuming ten employees per non-residential unit (Fariñas and Huergo, 2016), the threshold we need to impose is 250 non-residential units per km 2 . By considering a distance threshold of 250 m, the minimum number of non-residential units is therefore 49 5 . Table 5 presents the results of the delineation of the employment centers within each of the urban areas. Panel A shows that the footprint of the employment centers amounts to 886 km 2 (that is, 55% of the horizontal land of all the urban areas). Interestingly, the economic activity inside the urban areas is clearly more concentrated and presents a distinct pattern of location to that of residential use. When we analyze the vertical land associated with these employment centers, the surface increases to 3,060 km 2 (that is, 63% of the vertical land of all the urban areas).
Thus, unsurprisingly, insofar as the average number of floors is 3.5 in the employment centers, the density of buildings in these areas is higher.
Panel B shows that, with no restriction on the size of the employment centers, the 717 urban areas contain 2,056 employment centers. However, when we establish a minimum number of jobs per center (so as to take the largest economic agglomerations into consideration), the number of employment centers falls. Thus, there are only 70 employment centers with more than 10,000 jobs and just seven centers with more than 50,000 jobs. In fact, only the biggest urban areas have more than one employment center with more than 10,000 jobs. This is the case, for example, of the cities of Barcelona (nine employment centers), Valencia (five), Madrid (four) and Málaga (three). This evidence is in line with the polycentric structure of these urban areas as reported and analyzed by (Garcia-López, 2010, 2012. To illustrate how the algorithm works to delineate employment centers at a smaller scale, Figure 7 shows the nine biggest delineated centers in the urban area of Barcelona.

Comparing our delineated urban areas
Comparing our urban area delineation results, obtained with a machine learning method based on the geolocation of the country's buildings, with previous delineations performed for Spain is far from straightforward. The main reason for this is that all previous methodologies have taken the municipality (i.e. the political administrative local entity) as their starting unit of analysis.
In such studies, urban areas were built by aggregating surrounding municipalities to a central one. Spain has 8,131 municipalities, most of them quite small (in fact, 90% have fewer than 5,000 inhabitants) and with considerable variation in terms of their surface area. This suggests that these methodologies are likely to be much less precise and that their results cannot be treated at the same scale as ours. Despite this, it is nevertheless interesting to compare our results with those obtained using these different methodologies.
In recent years, there have been a few attempts to aggregate municipalities into urban areas.
The limitation of this particular exercise was that the initial population threshold imposed by the authors was 100,000 inhabitants and for this reason they identified just 41 local labor markets.
As discussed, comparing our urban area delineations with existing ones is conceptually challenging but empirically possible. By way of an initial approach, Panel B in Table 6 ranks the 10 largest urban areas delineated by the AUDES methodology that we consider to have the fewest limitations, as well as being the one for which we can access most data. We include our results in Panel A and, in Panel C, the outcomes corresponding to the municipalities' administrative borders, which we consider informative. For each urban area and method, we show the population (in thousands of inhabitants), the surface (in km 2 ) and the vertical area (in km 2 ).  On average, a comparison of our delineations with those performed by AUDES shows that our method delineates cities that contain less population (up to 15% less). Likewise, the pattern that emerges when considering area as opposed to population is similar, and if anything slightly more restricted boundaries than those identified by AUDES (our urban areas have a surface area that is one third less). Indeed, proportionally, these differences are larger than when considering population. This outcome is probably the consequence of our working with buildings instead of basing the aggregation on the municipal units. However, it is interesting to see how this approach affects some of the biggest urban areas. For example, while the population assigned to Madrid's urban area by AUDES is 40% greater than that delineated by our algorithm (6.1 vs 4.5M, respectively), the area within the city's boundary is almost seven times larger for AUDES than for our delineation (4,124 vs 691 km 2 , respectively). In the case of the urban area of Barcelona, the outcome is quite distinct. Both methods delineate an area of similar population (around 4.3M inhabitants) but the surface assigned by AUDES is 26% greater than that assigned when using our method (1,506 vs 1,191 km 2 ). Interestingly, the surface of the administrative area of Barcelona is just 99 km 2 . In contrast, the urban areas of Zaragoza and Murcia are especially extensive according to the administrative delineation of their borders (974 and 886 km 2 , respectively), but their surface sizes are much smaller according to our delineation method (88 and 427 km 2 , respectively). As for the vertical land of the urban areas, even this is greater, on average, according to the AUDES delineation, although the difference between the two methods is smaller than that for their respective surface areas.
Because our method takes buildings as its basic unit of analysis and does not impose any ancillary geography to calculate densities, the boundaries it generates do not include any low density spaces, which abound in the peripheries of Spain's municipalities. However, it should be borne in mind that these conclusions are based on a small subset of cities. To provide confirmation, we would need to expand the analysis to the entire set of delineations. In the sections that follow, we report different exercises aimed at comparing more accurately the outcomes of the different delineation methods.

Rand index and overlapping
In this section, we employ the adjusted Rand index (see Section 3) to compare our delineated urban areas with those of the AUDES project and the Spanish municipalities. As discussed, the index developed by Hubert and Arabie (1985) measures the ratio of agreements (buildings assigned to the same urban area in two delineations) over the total number of buildings, and its values range from 0, dissimilarity or no overlap, to 1, maximum similarity or complete overlap.
The ratio between our algorithm and AUDES is 0.350, while that with the municipalities is 0.003, indicating that the former provides the closest definition to our own delineations, while the municipalities is least similar 8 .
Another way to compare the three delineations is simply by analyzing the extent to which, and just where exactly, the alternative delineations (AUDES and municipalities) are included within the boundaries defined by our algorithm. Figure 8 presents two histograms showing the number of AUDES urban areas and Spanish municipalities included within our A-DBSCAN boundaries (Figures 8a and 8c, respectively) and two maps showing their location (Figures 8b   and 8d, respectively). In line with the evidence presented in the paragraph above, A-DBSCAN delineations coincide much more closely with the AUDES boundaries than they do with the municipalities. This is to be expected, given that AUDES are groups of municipalities linked by their common geography and commuting flows. However, the figure provides evidence that our method is able to approximate these same boundaries using a quite distinct approach.
Geographically, our maps also reveal a number of clear patterns. Our delineated urban areas containing parts of more than two AUDES are disproportionately located in the Mediterranean coast. We interpret this in terms of the type of urban development present in this region compared to that in the rest of Spain. The Mediterranean region is much more developed than the rest of the country. The density of urban development is also much higher, as can be discerned from Figure 2. This pattern results in A-DBSCAN identifying larger contiguous areas in which the building density is above the threshold required for an area to be considered urban. In turn, this makes our definitions of Barcelona, Valencia, or Alicante, among others, larger than their AUDES counterparts. In contrast, in the center of the country, a much sparser region with well delimited towns and cities, most urban areas delineated by A-DBSCAN contain only one AUDES. (c) (d)

City size distribution
Zipf's law suggests that a country's city size distribution can be approximated with a Pareto distribution with shape parameter equal to one. The higher (lower) the Pareto exponent, the more equally (unequally) distributed is the city system. The power law implies that, in a system of cities, the largest city is roughly twice the size of the second largest city, about three times the size of the third largest city, and so on. Indeed, since the seminal work of Gabaix (1999) and Eeckhout (2004), an enormous amount of city size distribution literature has been published (see Nitsch, 2005 andArshad et al., 2018 for a review of this literature).
The evidence reported by this literature is mixed. In some cases the law holds precisely but, in others, the outcome lies some distance from the unit parameter. The variety of results seems to be attributable to the city definition employed and, as a consequence, to the heterogeneity in the city samples used to perform the tests. Given this situation, it is interesting to compare the city size distribution by simulating an exercise performed by Rozenfeld et al. (2011) in which different definitions of city within the same country are taken into consideration. Thus, in the following paragraphs, we seek to determine whether the city size distribution in Spain depends on the definition of the units of analysis. Figure 9 plots the log-ranks against the log-sizes for the urban area boundaries created by: 1) our algorithm A-DBSCAN, 2) AUDES commuting-based patterns, and 3) the administrative municipalities. Panel A shows the plots for the three delineations using population as the measure of each urban area's size 9 . The estimated Pareto exponent is negative and very close to 1 for both our delineated urban areas and those from the AUDES project (-0.97 and -0.99, respectively). As discussed above, if the estimated value of the Pareto exponent is equal to one, then Zipf's law is confirmed as holding exactly for these two delineations. In contrast, the estimated parameter for the administrative municipalities is -1.23, indicating that Zipf's law does not fit in this case and also that the distribution of the population across these units is more unequal. However, the relationship does fit the log-linear specification quite well in all three cases (with an R 2 of 0.99, 0.99 and 0.88, respectively). Interestingly, in all three delineations the slope fits very well in the upper tail of the distribution. In both cases, the slopes show a log-quadratic, as opposed to a log-linear, relationship between surface-rank size. When the size of the urban areas is measured in terms of the vertical land (Panel C), the estimated coefficients for both our urban delineated areas and the AUDES areas are very close to -1 (-1.04 and -1.09, respectively). In the case of the municipalities, whose distribution presents a clear concave shape, the coefficient is -0.67 clearly indicating that Zipf's law does not hold for this delineation. A detailed inspection of the shape of the distribution of the AUDES urban areas shows that the biggest city in terms of vertical land is a clear outlier. This is not the case for our delineated urban areas, which present a more continuous pattern.  (2011), we substract 0.5 from the rank to improve the estimation.

City size and transportation
As has been well documented, one of the main problems urban economists face is the so-called 'Modifiable Areal Unit Problem' (MAUP) (Openshaw, 1984). As Briant et al. (2010) have shown, empirical results can change when different spatial units are adopted. In this subsection, we compare our delineated urban areas with AUDES urban areas and Spanish municipalities by studying how city size relates to transportation 10 .
Based on empirical studies analyzing the impact of transportation on city size measured in terms of population (Duranton and Turner, 2012, Garcia-López et al., 2015, Garcia-López et al., 2018, Baum-Snow et al., 2019 and land area (Brueckner and Fansler, 1983, Garcia-López, 2019), we can estimate the following equation: We consider three dependent variables to measure city size. First, in line with tradition, we consider the size of the city in terms of population using the log of the 2010 population (inhabitants). Second, we take into account the physical size of the city's surface, that is, its size in terms of land area (horizontal dimension) with the log of city surface (km 2 ). Finally, in line with recent studies by Ahlfeldt and McMillen (2018), Brueckner et al. (2017) and Liu et al. (2018), we also study the vertical dimension of the city by using the log of vertical land area (km 2 ).
Our main explanatory variables are related to transportation. Specifically, we consider the 2010 log of the length of the highway network (km), and the 2010 log of the length of the railroad network (km). To compute these, we use GIS maps of the road system and the railroad network in Spain that form part of the Büro für Raumforschung, Raumplanung und Geoinformation (RRG) GIS Database. The related empirical literature (see the survey by Duranton and Puga, 2015) considers these variables as proxies for transportation costs (and road congestion).
Since the different versions of the monocentric model show that socioeconomic, geographical and historical characteristics also shape city size, we include controls related to these features. In Appendix C we discuss in detail the different control variables and we report summary statistics for all variables using cities from the three delineation methods (Table C.1).  Notes: Robust standard errors in parentheses. Socioeconomic variables are the log of the average income and the share of population with a college degree. Geographical variables are altitude, the terrain ruggedness index, land area overlying aquifers, the length of rivers, annual median and range of precipitation and median and range of temperatures for winter and summer months. Historical variables are dummy variables for cities that were Roman settlements and Medieval major towns. a , b and c indicate significant at 1, 5, and 10 percent level, respectively.
The results for our delineated urban areas show positive and significant relationships between population and highways and railroads (column 1): The larger both transportation networks, the larger the size of the city in terms of population. Results for AUDES (column 2) and municipalities (column 3) are in the same direction but present some notable differences. First, the estimated coefficients are significantly smaller for these two alternative definitions. Second, the effect of railroads is not significant in the AUDES regression.
When we consider the size of the city in terms of land surface (columns 4, 5 and 6), we obtain similar results: Positive relationships showing that cities with larger highway and railroad systems tend to have larger land areas. However, in the case of our delineation (column 4), the estimated coefficient is significantly larger than those corresponding to AUDES (column 5) and municipalities (column 6). Furthermore, the effect of railroads is again not significant in the AUDES sample.
The results for the vertical size of the city (columns 6 to 9) are in line with the above outcomes.
In general, they show that larger highway and railroad systems are positively related to larger vertical developments. The largest effects are found for our delineated urban areas. Once more the effect of railroads is not significant in the AUDES regression.
In short, these results show that larger transportation networks can be related to larger cities in terms of their population, surface and vertical land. Our preferred results are those related to our delineated urban areas: First, because they are in line with the theory on urban spatial structure (Alonso, 1964, Mills, 1967, Muth, 1969, Brueckner, 1987, Duranton and Puga, 2015 and, second, because the other two delineations, AUDES and municipalities, seem to underestimate the effects.

Conclusions
Empirical research in Urban Economics has to address the challenge of identifying the best geographical unit of analysis for measuring what constitutes a city. But all too often information is provided solely for a city's local administrative/political units, even though there might be a clear consensus that such an approach fails to capture its real scope. To solve this problem, in recent years, and drawing on increasingly more sophisticated data, various methodologies have been developed to delineate urban areas.
In the paper, we present a new method for delineating urban areas based on very precise geolocated data for all the buildings in Spain. Using machine learning tools, we design and calculate a distance-based clustering algorithm that defines 717 urban areas containing three quarters of the whole population in less than 5% of the territory. Detailed information about the buildings allows us to better characterize the structure of the city in terms of its verticality and the location of its residential and non-residential activities. The algorithm can also be used to delineate the employment centers within these urban areas. When comparing our delineated urban areas with other delineations, we find that our urban areas are better measured, being more similar in this regard to commuting-based delineations than to areas delimited by administrative boundaries.
Our delineations are superior to those obtained using these other two methodologies because we do not include large areas of undeveloped land, which serves only to reduce the city's overall density. Because our delineated urban areas are spatially continuous collections of buildings rather than exogenous aggregations, such as grid cells or administrative boundaries, we believe that they better reflect the idea of an urban agglomeration based on a high concentration of inhabitants and firms. Technically, it should be stressed that our algorithm is robust to marginal changes in the data and that our results are computationally scalable in large datasets with millions of observations. Thus, one of the main advantages of our method is that, with the appropriate information, it can be replicated for other countries.
The use of our delineated urban areas as a unit of analysis for urban research is feasible when statistical information is available in a geocoded format. But, today, some information continues to be provided at the administrative level. Moreover, in some research fields, including Political Economy, Public Economics and Public Policy Evaluation, it is important to maintain the administrative borders in the analysis given that they continue to delimit the political-decision unit. However, in both instances, and with some simple adjustments, our urban areas can also be used. It remains our contention that a better definition of just what constitutes a city would improve the results of empirical analyses in Urban Economics and serve to guide policy makers when taking decisions that need to take into account the precise scope of the urban area.
2. Run the original DBSCAN algorithm on one of the subsamples to obtain a set of cluster labels for each point in that subset.
3. For each point in the second subset, assign the label of the nearest point in the first subset.

4.
Save the entire set of labels as a single candidate solution.
6. Align labels across candidate solutions so a given label represents the same cluster in each of the replications. This can be done in the following way: (a) Set the solution with most clusters as reference.
(b) For each cluster in the remaining solutions, find the nearest cluster in the reference and assign its label. In this context, nearest can be expressed as the shortest distance between the centroids of the two clusters.
7. For each point in the dataset, obtain the most common label and the proportion of times across candidate solutions where that label is assigned.
8. If the most common label appears a proportion of times that is smaller than a desired threshold (e.g. 90%), label the observation as noise; otherwise assign the most common label as the label for that point.
Since the core "clustering engine" of A-DBSCAN is DBSCAN, our extension carries with it all of the original benefits, including the independence of ancillary geographies and the densitybased criterion. In addition, since A-DBSCAN only requires to run DBSCAN on a potentially small fraction of the data, the approach is much more scalable. For A-DBSCAN to work, the original DBSCAN algorithm needs to be applied to a subset that is large enough to capture the overall spatial structure of the point pattern represented in the initial dataset. Furthermore, our approach is also more robust to outliers and thus more likely to accurately capture the underlying clustering process. The final label a point receives is not the result of a single run, but includes information based on several independent runs. A-DBSCAN exploits the assumption that the initial dataset is large to its advantage, treating it as a "pool" of replications from the underlying data generating process (DGP), that can be used to construct not only one but several observed patterns, with less observations but with the same properties and spatial structure. By considering several random subsets that come from the same DGP, we obtain empirical distributions for each observation and each label. This approach allows us to evaluate the uncertainty behind each assignment and to only label as part of a cluster those cases where we have sufficient evidence.
This is not possible in the traditional approach because only one assignment is carried out. The resulting label each point is assigned is thus robust to outliers, corner solutions, and other forms of noise. for i in D do:

12:
Obtain the most common label cl i and its frequency across R, pct cl

13:
if pct cl < thr, where thr is a set threshold then label cl as noise for r in R; r / ∈ L re f do: 18: for cluster cl r in r do

19:
Assign to cl r the label of the nearest cluster centroid in L re f  We control for socioeconomic characteristics by including the 2007 log of the median income (e) (based on the municipal estimates by Hortas-Rico and Onrubia (2016)). Using the 2011 Population Census grid data (1×1 km), we compute the share of population with a college degree.

Appendix C. Control variables and summary statistics
Following Burchfield et al. (2006), we control for geography. First, using data from Spain's Digital Elevation Model we compute the altitude (km) and the terrain ruggedness index developed by Riley et al. (1999). Second, we compute the land area overlying aquifers (km 2 ) and the log of the length of the rivers (km) crossing each city. Finally, using theÁtlas Climático Digital de la Península Ibérica, we compute the log of the annual median precipitation (mm) and the log of the annual range precipitation (mm). Similarly, we include the log of the median temperature ( • C) and the log of the range temperature ( • C) for December, January, February and March (winter), and for June, July, August and September (summer).