National-scale mapping of building height using Sentinel-1 and Sentinel-2 time series

Urban areas and their vertical characteristics have a manifold and far-reaching impact on our environment. However, openly accessible information at high spatial resolution is still missing at large for complete countries or regions. In this study, we combined Sentinel-1A/B and Sentinel-2A/B time series to map building heights for entire Germany on a 10 m grid resolving built-up structures in rural and urban contexts. We utilized information from the spectral/polarization, temporal and spatial dimensions by combining band-wise temporal aggregation statistics with morphological metrics. We trained machine learning regression models with highly accurate building height information from several 3D building models. The novelty of this method lies in the very fine resolution yet large spatial extent to which it can be applied, as well as in the use of building shadows in optical imagery. Results indicate that both radar-only and optical-only models can be used to predict building height, but the synergistic combination of both data sources leads to superior results. When testing the model against independent datasets, very consistent performance was achieved (frequency-weighted RMSE of 2.9 m to 3.5 m), which suggests that the prediction of the most frequently occurring buildings was robust. The average building height varies considerably across Germany with lower buildings in Eastern and South-Eastern Germany and taller ones along the highly urbanized areas in Western Germany. We emphasize the straightforward applicability of this approach on the national scale. It mostly relies on freely available satellite imagery and open source software, which potentially permit frequent update cycles and cost-effective mapping that may be relevant for a plethora of different applications, e.g. physical analysis of structural features or mapping society's resource usage.


A1. Choice of machine learning method
Random Forest Regression (RFR) and Support Vector Regression (SVR) are two widely used machine learning methods employed in the field of remote sensing. Based on the Berlin dataset, we have evaluated which method to use for predicting building height. This assessment complements the preliminary analysis as presented in section 4.3.3.a. and 5.1 of the main paper. We inter-compared the performance of the radar features, the optical features, as well as a synergistic combination of both on the Berlin dataset. For this, we tested two machine learning methods: 1) We trained SVR models using 90% of the training data. The SVR hyper-parameters were tuned using grid search with 10-fold cross-validation. The remaining subsample was used for model inspection. For clarity of presentation, Figure 6 from the main paper is reproduced here as Figure A1.
2) Analogous, we trained RFR models with 500 trees of maximal depth, and used a third of all available features at each tree node to find the optimal splits ( Figure A2). The OLS estimates suggest the superiority of RFR as compared to SVR. However, the frequency-weighted RMSE indicates the opposite. When inspecting the map representation of the building height prediction, we observed a largely consistent behavior between the methods, except for large commercial buildings ( Figure A3).
We presume that this is because the buildings are larger than the radius used for generating the morphological metrics, and as such, the prediction can only rely on the spectral/backscatter signal of the roofing material (i.e. no building shadows nor trihedral scattering mechanisms are present). Obviously, the RFR substantially overestimated building height in these cases, whereas more reasonable heights were predicted by SVRalthough not highly accurate either. This is further corroborated by Figure A4. The mean absolute percentage error indicates that RFR is performing worse for specific building heights, e.g. for short buildings (4-5 m) as well as for 9-12 m tall buildings, which appears to be a typical height for commercial buildings. For tall buildings > 30 m however, RFR may be superior. However, due to the rare occurrence of these buildings in Germany, we opted to select SVR for producing the wall-to-wall building height map.

1) Methods a) Pre-selection
The large feature space generated may result in overfitting and computational complexity related to the "curse of dimensionality" (Rust 1997). As a consequence, we reduced dimensionality using a two-step feature reduction approach. In the first step, we aimed at quickly eliminating the least important features using a fairly fast training technique that provides a feature ranking. For this, we employed a Random Forest regression (RFR) on a random 10% subsample (500 trees, 1% of features tested at each split for fast computation) and retained all features that contribute more than 1‰ feature importance, i.e. we only removed features that did not contribute information at all.

b) Model selection
In the second step, we switched to our target machine learning algorithm, i.e. Support Vector Machine regression (SVR). Similarly to Schug et al. (2020), feature reduction was achieved by repeatedly training models using random feature subsets. We trained 100 SVR models, and randomly selected 50 pre-selected features as a reasonable compromise between testing many feature combinations and performing costly feature reduction strategies like forward, backward, or exhaustive feature selection. For each mode, the SVR hyper-parameters were tuned using grid search with 10-fold cross-validation. The training was performed on 90% of the training data.
The remaining subsample was used for model selection, wherein the model with best performance in terms of regression slope, offset, R², and RMSE was eventually selected.

a)
Pre-selection Figure A5a-c shows the feature importance for the three different feature dimensions, i.e. spectral (a), feature importance, almost no group is entirely eliminated. However, more important groups are selected at a higher rate ( Figure A5d-f), e.g. SAR bands, temporal snapshot statistics or erosion and dilation metrics. The full list of pre-selected features with feature importance ranking is presented in Table A1. Figure A5. Feature pre-selection using Random Forest Feature Importance. Top: Boxplots of Feature Importance for the spectral (a), temporal (b) and spatial (c) domains, represented through bands/indices, aggregation statistics, and texture metrics, respectively; the y-axis is drawn logarithmic, thus outliers with 0% Feature Importance are omitted from this plot. Bottom: Number of selected features for the spectral (d), temporal (e) and spatial (f) domains. See Table A1 for a list with all pre-selected features.

b) Model selection
The performance indicators vary considerably from model to model ( Figure  The 50 randomly selected features of the best performing model are shown in Figure A6, grouped according to domain (see supplemental material for the full list). Several groups were entirely eliminated, while other groups are selected at higher or lower rates than present in the pre-selection. Among high-ranking groups in the preselection, NDVI was entirely eliminated and median and maximum statistics were selected at a lower rate. Lowwavelength bands (blue, green), the 2 nd red-edge band, kurtosis and closing were selected at a higher rate as compared to the pre-selection. The full list of selected features is presented in

3) Discussion
Along all feature dimensions, i.e. spectral, temporal and spatial domains, distinct features were selected at a higher rate. In the spectral dimension, both radar polarizations appear very relevant. This is likely due to the fairly direct relationship between backscatter and vertical structure (Li et al. 2020) There are also important optical bands and indices. Low-wavelength bands were likely selected as they capture brightness gradients, and vegetation-sensitive bands and indices were selected. These bands likely provide explanatory power to the machine learning model with regards to typical roof materials and vegetation compositions of particular settlement types. Short-wave infrared bands were not selected at all, which probability is due to its sensitivity to vegetation and water content, which is already covered by the near-infrared bands or not relevant for predicting building height, respectively.
Within the temporal dimension, temporal variability and data distribution statistics were selected at a lower rate, whereas most quantiles and the average are very important. The quantiles represent different temporal snapshots, and their combination implicitly include the temporal variability while also providing spectral information. Variability and distribution statistics were likely not selected as they provide redundant information when several quantiles are available. Similarly, the median was hardly selected, whereas the average was very dominant, which is likely due to redundancy.
In the spatial dimension, two texture metrics were hardly selected (blackhat, tophat), while three metrics appear very important (erosion, opening, closing). The tophat and blackhat metrics are operations that enhance bright and dark elements in a dark and bright background, respectively. When applied to highly structured urban landscapes, the resulting images appear sharpened, which is probably an undesired property as most information content is rather encoded in the spatial neighborhood (e.g. shadow effects). The erosion (dilation) operation assigns the minimum value in the structuring element to the central pixel, i.e. in a city, it selects the darkest pixel / shadow (brightest pixel / roof). The opening operator performs a dilation on the eroded image, i.e. in a city, it selects the brightest of the selected shadows, whereas the closing operator preforms an erosion on the dilated image, i.e. in a city, it selects the darkest of the selected roofs. By combining those three texture metrics with each other, one can assume that some logical height derivation is possible, especially if complementary spectral and temporal information is added.