MTBF-33: A multi-temporal building footprint dataset for 33 counties in the United States (1900 – 2015)

Despite abundant data on the spatial distribution of contemporary human settlements, historical datasets on the long-term evolution of human settlements at fine spatial and temporal granularity are scarce, limiting our quantitative understanding of long-term changes of built-up areas. This is because commonly used large-scale mapping methods (e.g., computer vision) and suitable data sources (i.e., aerial imagery, remote sensing data, LiDAR data) have only been available in recent decades. However, there are alternative data sources such as cadastral records that are digitally available, containing relevant information such as building construction dates, allowing for an approximate, digital reconstruction of past building distributions. We conducted a non-exhaustive search of open and publicly available data resources from administrative institutions in the United States and gathered, integrated, and harmonized cadastral parcel data, tax assessment data, and building footprint data for 33 counties, wherever building footprint geometries and building construction year information was available. The result of this effort is a unique dataset that we call the Multi-Temporal Building Footprint Dataset for 33 U.S. Counties (MTBF-33). MTBF-33 contains over 6.2 million building footprints including their construction year, and can be used to derive retrospective depictions of built-up areas from 1900 to 2015, at fine spatial and temporal grain. Moreover, MTBF-33 can be employed for data validation purposes, or to train statistical learning models aiming to extract historical information on human settlements from remote sensing data, historical maps, or similar data sources.

contains over 6.2 million building footprints including their construction year, and can be used to derive retrospective depictions of built-up areas from 1900 to 2015, at fine spatial and temporal grain. Moreover, MTBF-33 can be employed for data validation purposes, or to train statistical learning models aiming to extract historical information on human settlements from remote sensing data, historical maps, or similar data sources.  We identified U.S. counties or states that provide building footprint data and cadastral parcel data attributed with building construction year information. In a non-exhaustive search we identified 33 U.S. counties where these criteria were met. We integrated and harmonized these data to create geospatial vector datasets holding over 6

Value of the Data
• Open and publicly accessible data on building age are scarce. Our data scraping, integration, and harmonization effort aims to fill this gap in the data landscape. • Knowledge of the construction year of individual buildings allows for creating spatially and temporally fine-grained depictions of past built-up surfaces. • Such spatial-historical data may serve as calibration and validation data for urban change models and for historical (and more recent) human settlement datasets, as well as for historical population downscaling effort s. • Moreover, such data are very useful to be employed as auxiliary data for the automated training data generation for data-intensive (e.g. deep learning) computer vision models to automatically extract urban change signals from remote sensing data or historical maps. • These data enable historical analyses of the building stock in 33 U.S. counties, encompassing the whole state of Massachusetts, as well as several urban areas of different settlement age and characteristics, such as Boston, Charlotte, and Minneapolis. • Lastly, these data are highly valuable for urban planners, remote sensing analysts, historians, demographers, and data scientists working in the context of urban land use change and (sub)urbanization, as they provide rare insight into the long-term dynamics of built-up areas at very high spatial and temporal detail.

Data Description
We collected open and publicly available data resources from the web from administrative, county-or state-level institutions in the United States and integrated and harmonized cadastral parcel data, tax assessment data, and building footprint data for 33 counties, where building footprint data and building construction year information ("year built") was available. The result of this effort is a unique dataset called the Multi-Temporal Building Footprint Dataset for 33 U.S. Counties (MTBF-33, [1] , available at http://dx.doi.org/10.17632/w33vbvjtdy ). MTBF-33 contains over 6.2 million building footprints including their construction year, and is available as polygonal geospatial vector data in 33 ESRI Shapefiles, in geographic coordinates (WGS84, EPSG:4326), as well as projected into Albers equal area conic projection for the contiguous USA (USGS ver- sion, SR-ORG:7480 1 ), organized per county. Fig. 1 shows small subsets of the MTBF-33 dataset for selected regions.
Moreover, Fig. 2 shows small subsets of the data for most of the 33 U.S. counties, illustrating the variability in the data and their coverage across different geographic settings.
As can be seen in Figs. 1 and 2 , there are several buildings without year built attribute (white color). We report the year built attribute completeness for each of the 33 counties in Table 1 . Moreover, Table 1 shows some basic year built statistics, illustrating the variety in temporal coverage of the data.

1) Data creation
We manually collected cadastral parcel data, tax assessment data, and building footprint data from publicly and openly available web resources, such as from state-level or countylevel administrative GIS or spatial data resources. We used open data portals such as https: //hub.arcgis.com to identify counties or states where (a) both parcel and building footprint data is available, and (b) parcel data or joinable tax assessment data contains information on the year when structures have been established (year built). We identified 33 counties that satisfied these criteria and where the completeness of the building footprint data and the year built attribute was acceptable (see Table 1 ).  In counties where the year built information was contained in separate tax assessment datasets, we first joined the tax assessment data to the parcel data. Then, we integrated the parcel data and building footprint data. This integration was done through a spatial join operation, in order to transfer the year built attribute from the parcel polygon features to the building footprint features contained within the cadastral parcel boundaries. This spatial assignment was based on a majority-area criterion in order to account for certain levels of spatial offsets between parcel and building footprint data. Such offsets may exist due to different data acquisition methods: While parcel boundaries are typically measures using terrestrial or Global Navigation Satellite System (GNSS)-based land surveying technologies, building footprint data may be obtained through automatic segmentation of LiDAR data or by digitization in aerial imagery.
As a result of these spatial joins, the year built attribute was transferred to the building footprint features. For these processes, we used the GeoPandas 2 and ESRI ArcPy 3 python package. We then harmonized and cleaned the data. This cleaning process involved the identification of non-plausible year built values (e.g., < 1500). We do this to remove structural zero values representing missing information, and obviously incorrect values, likely resulting from typos, such as "910" which could be "1910". The threshold of 1500 was chosen as it marks the approximate end of the Pre-Columbian era, and very few built-up structures or dwellings from that era are still intact, and if so, they are rather considered a monument or landmark than a "building" as defined in modern building stock databases. Such missing or non-plausible year built values were set to 0. Importantly, any property-, building-, or individual-level data other than the year built attribute was removed, so that the MTBF-33 data exclusively consists of building footprint geometries and their construction year. The resulting polygonal, geospatial vector data represent building footprints for 33 counties in the conterminous United States, allowing for the reconstruction of spatially and temporally fine-grained depictions of built-up surfaces (i.e., building level, annual temporal resolution).
While contemporary building footprint data is available at high levels of accuracy [3] , data on the historical distributions of the U.S. building stock is very scarce, in particular for time periods earlier than the 1970s or 1980s, when remote-sensing-based, digital earth observation data became accessible. Thus, despite representing only about 1% of the U.S. counties, this unique dataset covers more than 40,0 0 0 km ² and more than 6,0 0 0,0 0 0 cadastral parcels, and fills an important gap in the geospatial data landscape. The MTBF-33 dataset was collected in 2016-2017 and since then, MTBF-33 has been employed by the authors for different purposes, including the validation of global remote-sensing-based multi-temporal built-up surface data [4][5][6][7][8] , the validation of historical settlement data derived from property databases [9 , 10] , to automatically generate training data for urban change detection based on Landsat time series data [11] , to assess the sensitivity of Landsat time series to urban changes [12] , and for training data generation used by computer vision models to extract settlement patterns from historical topographic maps [13 , 14] .

2) Validation and uncertainty assessment:
The validation of data that entail advancements in quality and accuracy compared to existing data products is always challenging. We evaluated MTBF-33 through two approaches. First, we compared the dataset for all 33 counties with the Microsoft Building Footprint (MSBF) dataset [3] for the building footprints existing in 2015. Second, we carried out a visual comparison between the MTBF-33 and urban extents as found in historical topographic maps.

Agreement assessment between MTBF-33 and Microsoft building footprint data
Microsoft's building footprint dataset (data release from 2018) has a US-wide coverage and has been extracted from Microsoft Bing imagery using a deep-learning-based computer vision method. While the acquisition years of the Bing imagery are likely to vary across the United States, we assume that MSBF represents the U.S. building stock approximately in 2015, expecting an average period of three years between image data acquisition and building footprint data release. In order to evaluate the MTBF-33 quantitatively, we created gridded binary layers (i.e., built-up versus not built-up) for 2015 for both MTBF-33 and MSBF, in a spatial grid of 250m x 250m, based on an area majority rule, for each of the 33 counties covered in MTBF-33. Based on these gridded surfaces, we established confusion matrices per county, used to calculate various agreement measures to assess agreement between the two binary layers, using the MSBF data as reference data ( Table 2 ). While some of these measures have been criticized due to some limitations, e.g., if class proportions are imbalanced [15 , 16] , a cross-section through all those measures represents a reliable assessment basis. We carried out the agreement assessment across all 33 counties, as well as separately for higher-density and lower-density counties. This stratification was done based on the MSBF built-up surface density per county, using the median county-level built-up surface density as a threshold.
As can be seen in Table 2 , when using all 33 counties for map comparison, all accuracy measures show high agreement between the two layers (between 86.5% based on Kappa and 95.6% based on Precision). Higher accuracy is observed for high-density counties compared to the low-density counties. The notable difference in Recall (0.99 and 0.85, respectively) indicates that omission errors are higher in low-density counties, possibly because MSBF identifies structures that are not part of the county assessor's building stock database (e.g., barns), especially in more rural settings. Thus, MTBF-33 has no built structures at those locations, resulting in higher proportions of false negatives. This effect propagates into the other measures, resulting in reduced F-measure and Kappa index for lower-density counties. However, even in lower-density counties, no accuracy measure is less than 0.83 indicating high levels of agreement between the two datasets. Moreover, there may be slight temporal gaps between MTBF-33 (representing the building stock in 2015) and MSBF (heterogeneous acquisition years of the imagery underlying the MSBF data), due to the heterogeneous levels of temporal coverage of the MTBF-33 data (see built year statistics in Table 1 ), and due to the vagueness in the definition of the construction year in MTBF-33. These factors could explain some of the slight disagreement observed in Table 2 . Here, it is worth noting that the aforementioned, assumed period of three years between Bing imagery acquisition and data release in 2018 may be overly optimistic, i.e., the underlying imagery may have been acquired prior to 2015 and thus, MSBF may reflect an earlier state of the building stock. Assuming predominant growth (rather than shrinkage) of the building stock over time, such an incorrect time stamp of our reference data (i.e., MSBF) would result in artificially inflated commission errors (i.e., low precision) in our test data (i.e., the MTBF-33) which represents a later state of the building stock. However, as shown in Table 2 , Precision is very high across all strata, and thus, the unknown temporal reference of the MSBF data and the effects of a potential temporal gap may explain the observed commission errors of 0.040 to 0.046.

Qualitative evaluation of MTBF-33 data against historical maps
We carried out a visual comparison between the MTBF-33 and urbanized extents as shown in historical topographic maps of the U.S. Geological Survey (USGS) historical topographic map collection (HTMC 4 ) [17] . To do so, we created historical binary layers of the MTBF-33 to match the publication dates of the historical map sheets. Fig. 3 shows historical map sheets for Boulder, Colorado, and the respective extracted MTBF-33 binary layers for 1904, 1957 and 1984. As can be seen, the spatial representations of built-up / urbanized areas generally agree. Agreement is highest within urban areas as shown by denser building blocks along roads in 1904, in red in 1957 and grey in 1984, even though MTBF-33 shows much finer spatial detail of the built-up areas. Outside the urbanized areas, the 1904 and 1957 maps show detailed building symbols along roads, many of which are also visible in the MTBF-33 layer. Some discrepancies can be seen in the North-west part of the 1984 map, where some buildings are visible in MTBF-33 but not in the historical map. This is due to the level of cartographic generalization used in the 1984 map sheet (scale 1:10 0,0 0 0, whereas the 1904 and 1957 maps are at scale 1:62,500) which may not include individual building footprints outside urban extents.
Moreover, such discrepancies may be due to temporal uncertainty in the historical maps (i.e., temporal gap between land surveying or field check, and map edition / publishing year) and due to the potential uncertainty of the construction year information in MTBF-33. It is unknown whether the date on record reflects the beginning or the end of the building construction phase, and how long the construction phase endured. Moreover, the construction year could be an estimate, and buildings may be missing in MTBF-33 because of incomplete records or missing built year information. However, the visual similarity for the two historical map sheets combined with the quantitative agreement assessment against Microsoft's building footprint dataset provide strong confidence that the built-up surfaces in MTBF-33 are very plausible and accurate with some local variations in completeness.
There are uncertainties in MTBF-33 that are very difficult to measure. For example, buildings that have been torn down or were destroyed are not included in the dataset. The respective parcels may have become vacant land or a new structure may have been built. As a consequence, there is survivorship bias in construction year information which increases as we go back in time. There are few studies that report on or measure survivorship bias in settlement layers as this requires access to historical versions of cadastral data or demolition records (e.g., [18][19][20][21] . For example, McShane et al. [21] used historical demolition data for Colorado and found that survivorship bias had limited impact on the resulting settlement layers, resulting in relative errors of less than 2%. Note that while we retain all plausible year built values from the scraped source data, we constrain the temporal coverage to the period 1900 -2015. We discourage data users to use MTBF-33 to create snapshots of built-up surfaces preceding the year 1900, as the survivorship bias may be very large.

Ethics Statements
None.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.