Large-scale Building Height Retrieval from Single SAR Imagery based on Bounding Box Regression Networks

Building height retrieval from synthetic aperture radar (SAR) imagery is of great importance for urban applications, yet highly challenging owing to the complexity of SAR data. This paper addresses the issue of building height retrieval in large-scale urban areas from a single TerraSAR-X spotlight or stripmap image. Based on the radar viewing geometry, we propose that this problem can be formulated as a bounding box regression problem and therefore allows for integrating height data from multiple data sources in generating ground truth on a larger scale. We introduce building footprints from geographic information system (GIS) data as complementary information and propose a bounding box regression network that exploits the location relationship between a building's footprint and its bounding box, allowing for fast computation. This is important for large-scale applications. The method is validated on four urban data sets using TerraSAR-X images in both high-resolution spotlight and stripmap modes. Experimental results show that the proposed network can reduce the computation cost significantly while keeping the height accuracy of individual buildings compared to a Faster R-CNN based method. Moreover, we investigate the impact of inaccurate GIS data on our proposed network, and this study shows that the bounding box regression network is robust against positioning errors in GIS data. The proposed method has great potential to be applied to regional or even global scales.


Introduction
Three-dimensional (3-D) building models are widely used in public and commercial sectors for environmental researches and location-based services.For the past three decades, 3-D building reconstruction has been a hot topic in remote sensing [1]; however, there is limited information on the third dimension, i.e., building height, on a regional or global scale.Studies on building height retrieval primarily employ high-resolution optical images and airborne LiDAR data [2].Optical data acquisition requires the weather to be cloud-free, and airborne or terrestrial data are too expensive to collect globally.
Synthetic Aperture Radar (SAR) imagery, on the other hand, is capable of providing data regardless of time or weather conditions.Such data are of great interest to applications of disaster responses [3,4] and to studies concerning regions frequently covered by clouds [5].Since the launch of TerraSAR-X in 2007, modern SAR satellites, e.g., TerraSAR-X, TanDEM-X, and CosmoSky-Med, have been providing meter or even sub-meter resolution images, making it possible to extract and reconstruct man-made objects from spaceborne SAR data.In addition, complete global coverages of TerraSAR-X and Tandem-X stripmap mode data have been acquired since 2012, providing great potential as a data source for global building reconstruction [6].
The study of building analysis from SAR imagery dates back to 1969, that Laprade and Leonardo derive the elevation of a few buildings from shadows and layovers using simulated radar images [7].Since then, various studies have been conducted on this topic [8,9,10,11,12,13].However, building interpretation from SAR data is highly challenging.Due to the side-looking geometry and one-band radar sensors, urban structures are clearly visible in SAR images but are difficult to distinguish from each other.Several works [10,12,14] develop tailored algorithms for building analyses in complex urban environments, but these methods are limited to be applied for large-scale areas.Recently, deep neural networks are applied for individual building segmentation in SAR images [15].Building heights are subsequently computed based on the radar viewing geometry.However, for annotating building areas, this method requires an accurate digital elevation model (DEM), which is unavailable in most areas and thus restricts this method from being generalized to other regions.
In this work, we are interested in the height estimation of individual buildings on a large scale, using single SAR images.We develop a method that takes SAR images and building footprints as input and retrieves building heights by predicting bounding boxes of buildings (cf. Figure 1).Next, we briefly explain the challenges involved in this task and review related work.

Challenges
Because of the side-looking imaging geometry and complex backscattering mechanism, SAR image interpretation is a generally difficult task.For interpreting individual buildings in urban SAR images, the challenges are mainly two-fold: For an isolated building in SAR images, the main challenge is to recognize its components, i.e., the roof, walls, and footprint.Figure 2 illustrates the amplitude profile of two flat-roof buildings in a slant-range SAR image.As can be seen, the wall area lw and the roof area lr in SAR images are always mixed and difficult to differentiate: lw covers lr when the building height h is large (cf. Figure 2 (a)), and it is covered by lr when h is small (cf. Figure 2 (b)).In addition, for low-rise buildings, the roof area lr partially overlaps the footprint area lf (cf. Figure 2 (b)), and therefore the near-range side of lf might be ambiguous.Moreover, the far-range side of the footprint area lf is unknown, as it connects the shadow area that also appears dark in SAR images.For multiple adjacent or nearby buildings, a more crucial issue is to identify them correctly.Since the intensity values in SAR images are closely related to material types and structural shapes of objects, consecutive buildings in the physical world are difficult to separate in a SAR image unless obvious material or structure changes exist at building boundaries.In addition, even if buildings in the real world are not neighboring, they may overlap each other in SAR images, which significantly increases the difficulty of image interpretation.For example, Figure 1 (a) shows a typical urban region in a TerraSAR-X spotlight image containing several buildings whose footprints and bounding boxes are plotted in Figure 1 (b) and (c), respectively.By only looking at the SAR image, it is unlikely to tell the numbers of buildings or distinguish connected buildings, for instance, the purple building and the green building.Besides, it is noticeable that the green building overlaps the yellow and the blue ones in the SAR image, although their footprints are not connected.

Related work
In literature, researchers have investigated building height retrieval from a single SAR image [7,16,11,17], InSAR data [18,19], and multi-aspect SAR or InSAR data [20,21,22] or even circular SAR [23,24] to overcome the drawback of occlusions originated from the side-looking geometry.In addition to SAR data, auxiliary data, e.g., building outlines extracted from optical images [25,12] and building footprints obtained from GIS data [26,13,27], are introduced for providing exact locations and geometric shapes of buildings in the real world.
To retrieve building heights from SAR data, most researchers employ data-driven approaches that first extract salient features such as double bounce lines, layovers, shadows, and InSAR phases and then deduce building parameters.Some researchers first detect bright-line segments and regularly spaced point-like features and subsequently group them into building regions [9,22,28,29,30].Alternatively, several studies directly extract building regions of layovers and shadows using segmentation algorithms, such as the mean shift algorithm and conditional random field (CRF) [31], marker controlled watershed algorithm [17], thresholding and morphological operations [32].Besides data-driven approaches, model-based methods are conducted.Detailed modeling of the geometrical and radiometric properties of isolated buildings is performed in [16,33,10], from which building shapes in SAR data are estimated.Such techniques require extensive prior knowledge about objects, such as materials, roughness, humidity, and orientation with respect to the SAR sensor, which is generally unknown.Another model-based approach is developed in a simulating and matching fashion [34,35,14,36,37].This approach comprises a simulation step in which a SAR image or InSAR phase is simulated using a building hypothesis and a matching step in which the simulated data are matched with real data.The process is conducted iteratively until the optimal building parameters are achieved.
Although a considerable amount of research has been carried out on building height retrieval using SAR data, few studies have investigated this problem on a large scale.Most methods target buildings with specific shapes, e.g., rectangular- [38,39,40] or L-shaped footprints [41,17], flat [42] or gable roofs [43,44], and different heights [44,45,46,47].The majority of studies investigate simple scenarios where a minimal distance between buildings is required to ensure scattering effects of different buildings do not interfere with each other [48,11,12,49].Moreover, the performance of the presented methods is typically presented for a small set of test data, usually comprising only one or a few buildings.The generalisability of much published research on this issue is therefore problematic.
In recent years, deep neural networks have been becoming increasingly popular and triggered breakthroughs in many fields, including a wide range of remote sensing applications [50,51,52,53,54,55,56,57,58].In contrast to classical approaches that require expert domain knowledge and hand-crafted features, deep networks rely on a large amount of raw data to learn effective feature representations in an end-to-end fashion.But the major problem preventing applying deep networks to urban SAR analysis tasks is the lack of annotation data.To address this issue, Shahzad et al. [59] introduce a SAR tomography (TomoSAR) point cloud to acquire building areas in a SAR image and take them as ground truth annotations to train a segmentation network to extract building areas.In [60], Sun et al. generate building areas in a SAR image using a DEM instead of a TomoSAR point cloud, as the latter is rare.However, both two works do not annotate individual buildings.Shermeyer et al. [61] present a multi-sensor all weather mapping (MSAW) dataset containing airborne SAR images, high-resolution optical images, and building footprint annotations.However, building footprints, instead of building heights, are the learning target in this work.In [15], Sun et al. annotate individual buildings in a TerraSAR-X spotlight image employing a highly accurate DEM and propose a segmentation network for predicting building areas in the SAR image.The segmentation results are then applied to reconstruct building heights.This work has segmented individual buildings from a single SAR image on a large scale for the first time.However, pixel-wise labels are expensive.The data set generation approach requires accurate DEMs.The unavailability of accurate DEMs restricts this method from being generalized to larger areas.

Contributions
This work aims to retrieve building heights using a single SAR image on a large scale.To overcome the problem and improve transferability, this work proposes to generate annotation data with building heights that can be acquired from multiple sources.The task of building height estimation is formulated as a bounding box regression problem, i.e., a task to regress the center coordinate and the size of the bounding box for each building.
The main contributions of this paper are three-fold: a. We propose a workflow for building height retrieval in single SAR images with GIS data.To our best knowledge, this is the first time that deep networks are employed in the problem of building height retrieval in large areas from TerraSAR-X images in both high-resolution spotlight and stripmap modes.
b.We formulate the problem of building height retrieval as a bounding box regression problem and propose a bounding box regression network that is very efficient owing to the tailored use of building footprints.The fast computation speed is significant for large-scale applications.
c.We propose a ground truth generation approach to produce building bounding boxes.This approach is able to integrate multiple sources of building heights, thus providing large potential in analyzing complex urban regions.
The remainder of this paper proceeds as follows.Section 2 formulates the problem and delineates the proposed method.Section 3 is concerned with the dataset generation approach to tackle the problem of dataset scarcity.The experiments and results are presented and analyzed in Section 4. In Section 5, we discuss several practical problems related to applying our method to large-scale building height retrieval.Finally, Section 6 concludes this paper.

Problem formulation
We consider LoD1 building models, i.e., prismatic models with flat roof structures [62].Due to the radar viewing geometry, scatterers on a vertical line in the geographic coordinate system always have the same azimuth coordinate in a SAR image, i.e., this vertical line in the SAR image parallels the range direction.Therefore each vertical building wall in a SAR image has one pair of opposite sides paralleling the range direction.This can be observed in Figure 1.Hence, the extent of a building in a SAR image is bounded by two vertical lines from building walls in the azimuth direction and the region of the layover and footprint in the range direction.In this work, we exploit this geometric relationship to retrieve building heights.
Figure 3 illustrates this geometric relationship by two buildings in the Universal Transverse Mercator (UTM) and the SAR image coordinate systems.On the left of the figure, b1 and b2 are two buildings in the UTM coordinate system and are imaged on a SAR image plane.As can be seen, sensor-visible walls (yellow and blue) are projected into the SAR image as parallelogram shapes, and vertical sides of the walls parallel the slant-range direction.The building height h is directly related to the layover length L: where θ is the incidence angle.
On the right of Figure 3, b1 and b2 and their bounding boxes (green) are shown in the SAR image coordinate system.As can be seen, for both b1 and b2, the layover length L is the width difference between the building bounding box and the footprint bounding box: Therefore, for a building in a SAR image, its height can be obtained once its footprint is known and its bounding box is detected.Based on the geometry relationships, we formulate the problem of building height retrieval from SAR images as a bounding box regression problem.I.e., given a SAR image and a building's footprint, find the bounding box of the building, and then derive the building height from it.

Footprint-guided bounding box regression
We propose a footprint-guided bounding box regression network for building height retrieval that exploits the location relationship between a building's footprint and its bounding box. Figure 4 provides an overview of the proposed workflow.Specifically, in the network structure, we concatenate a SAR image and a building footprint mask as the input of the network.ResNet-101 [63] is employed as the backbone.ResNet-101 has in total 101 weighted layers, including 5 blocks of convolutional layers, i.e., conv1, conv2, conv3, conv4, conv5, and each contains a multi-layer deep subnetwork.First, conv1 to conv4 in ResNet are utilized to extract feature maps.We extract the footprint bounding box from the building footprint mask and map it to the feature maps as the region of interest (RoI) of the building, i.e., the initial bounding box to be corrected.For each RoI, local features are pooled by RoI-Align [64].Then, conv5 of ResNet takes the pooled features, and a global average pooling layer and a fully connected layer proceed to predict corrections for the RoI with respect to the ground truth bounding box.The corrections are then added to the RoI of each building to produce its bounding box.Finally, building heights are derived from the predicted bounding boxes and are used to extrude LoD1 building models from the building footprint polygons.
For the parameterizations of bounding boxes, we adopt the (x, y, w, h) coordinates used by R-CNN [65]. 4 be the bounding box representation as a 4-dimensional vector, where x, y, w, and h denote the box's center coordinates and its width and height in an image patch.The task of bounding box regression is to regress a candidate bounding box B into a target bounding box In our case, B is the footprint bounding box, and G is the building bounding box.The network predicts the distance vector ∆ = [δ x , δ y , δ w , δ h ]: (3) We employ the complete intersection over union (CIoU) loss [66], which considers three geometric factors of bounding boxes: the overlap area, the central point distance, and the aspect ratio.CIoU is defined as: where b and g denote the central points of B and G, ρ is the Euclidean distance, c is the diagonal length of the smallest enclosing box covering the two boxes, α is a positive trade-off parameter, and v measures the consistency of the aspect ratio.IoU , α, and v are defined as follows: (5)

Reference Data Generation
For training our network, building bounding boxes as reference data and building footprints as input data in the SAR image coordinate system are necessary.For this reason, we develop a workflow that employs building footprint and height data to automatically label building bounding boxes and building footprints in SAR images.The proposed workflow comprises three steps that are illustrated in Figure 5: 1) building heights acquisition, 2) building footprint masks generation, and 3) building bounding boxes generation.In the following sections, we explain the details.After data set generation, our network concatenates a SAR image and a building footprint mask from the training set as input and predicts a correction for the footprint bounding box with respect to the building bounding box.Then the trained model is evaluated using the testing set.Building heights are computed from the predicted bounding boxes and building footprints, and subsequently, LoD1 building models are reconstructed.

Building height acquisition
The first step is to collect building data.For each building, we collect the building height h and the ground height h ground , along with its footprint coordinates (x, y).
The proposed workflow only requires one single height value for one building, which can be acquired from various data types, such as city models, LiDAR data, and accurate DEMs. Figure 6 shows an example of three data sources of building heights of the same area in Berlin.In some cities, public data sets are available that can be utilized to generate our annotations, e.g., Berlin city models [67], NYC open data [68], and 3D Buildings and Addresses of the Netherlands (3D BAG) [69].This loose requirement of the height data source significantly reduces the barrier of training data creation, which in turn supports the generation of reference data on a larger scale.

Generation of building footprint masks in SAR images
In the previous step, building data are acquired in the UTM coordinate system.For our task, building footprints need to be projected to the SAR image coordinate system.That is to say, for each building footprint, its coordinates (x, y, h ground ) need to be transformed to (rg, az), where rg and az denote range and azimuth coordinate, respectively.Generally, the coordinate transformation from the UTM coordinate system to the SAR imaging coordinate system includes iterative solving Doppler-Range-Ellipsoid equations that can be implemented with different approaches [70,71,72,73].In this work, radar coding was performed using DLR's Integrated Wide Area Processor (IWAP) [74].Note that further registration is needed if the ground height h ground is not accurate [75,76].
Then, building footprint masks are generated according to range-azimuth coordinates of the radar-coded vertices of building footprint polygons.For each building footprint mask, we set the pixel value to be 1 inside the footprint polygon and 0 elsewhere.

Generation of the ground truth building bounding boxes in SAR images
To generate the ground truth bounding box of a building, we first compute its footprint bounding box B f .B f is defined by four values in pixels [rg f , az f , L f , w f ], in which (rg f , az f ) are coordinates of the center point of the bounding box, and L f and w f are the width and height of the bounding box, respectively, as illustrated in Figure 5.
Then, the building bounding box B b is generated from B f .As illustrated in Figure 3, the difference between B b and B f results from the added width L, which is the layover length corresponding to the building Finally, we remove possible wrong bounding boxes.Since the used SAR image and height data are often collected at different times, there might be inconsistencies resulted from urban changes, such as building construction and deconstruction.We deal with this situation using the intensity values of the given SAR image.In the SAR image, the intensity values are generally larger in building areas than in ground areas.Therefore, a threshold is set to be the mode of the intensity values of the SAR image to exclude bounding boxes in which the mean intensity values are smaller than the threshold.

Data description
The performance of the proposed method is evaluated on four data sets, including one TerraSAR-X high-resolution spotlight (HS) image acquired over Berlin and three TerraSAR-X stripmap (SM) images acquired over Berlin, Rotterdam, and south Brooklyn in New York.Our four data sets are termed Berlin HS, Berlin SM, Rotterdam, and New York.Figure 7 (a) shows the study region in Berlin, and the SAR images in Berlin HS and Berlin SM data sets are both cropped to cover the same region.Table 1 lists the main characteristics of the used SAR data, data sources of building footprints, and data sources of building heights in each data set.In this work, we make use of height data from LoD1 building models and accurate DEMs.LoD1 models represent buildings as blocks with flat roof structures and contain one height for each building [62].As for DEMs, we regard the average roof height as the building height 1 .By using the workflow described in Section 3, building bounding boxes and footprint masks are generated.For each building, our data set contains a SAR image patch, a footprint mask, and a bounding box of the building.
Table 1: Characteristics of the used SAR data, data sources of building footprints, and data sources of heights in each data set.

Data set
TerraSAR-X pixel spacing: pixel spacing: incidence data source: data source: imaging mode rg direction (m) az direction (m) angle (

Training details
To train an effective and robust network, we first cropped the SAR image into patches.Patches containing incomplete footprints or bounding boxes are discarded.In the four data sets, the high-resolution spotlight  SAR images are cropped into patches of 256 × 256 pixels with a stride of 150 pixels, and the stripmap SAR images are cropped into patches of 128 × 128 pixels with a stride of 70 pixels.Consequently, building data in the study areas are prepared, and each building has a ground truth bounding box and two patches: a SAR image patch and a footprint mask patch.All building samples are then divided to build the training set and the testing set.We made sure that the training and test regions do not overlap.The network takes one SAR patch and the corresponding GIS patch for one building as inputs.Table 4 lists the patch size and sample numbers of training/testing sets of each data set.Before feeding data into models, data sets were normalized into the range of [0, 1].For data sets generated using stripmap SAR images, image patches are re-scaled to 256 × 256 pixels.
The network is implemented on PyTorch and trained on one NVIDIA Tesla P100 16GB GPU.During the training procedure, the layers of ResNet are initialized with weights pre-trained on ImageNet [78], and other layers are randomly initialized by drawing weights from a zero-mean Gaussian distribution with a standard deviation of 0.01.All weights are updated through back-propagation, and we use stochastic gradient descent (SGD) [79] as the optimizer.The learning rate is initialized as 0.001 and reduced by a factor of 0.1 once the loss stops to decrease for three epochs.We use a momentum of 0.9 and a weight decay of 0.0005.For hyperparameter settings of batch size and training epochs, we tested using the Berlin HS data set and the numerical results are shown in Table 2 and Table 3.The network is trained for 10 epochs, and we utilize a small batch size of 4.

Comparative experiments
For our problem, the major focus is to predict the bounding box of each building correctly.As bounding box regression is also an important task for object detection, object detection networks can be employed for our problem by deriving building heights from the predicted bounding boxes.
In the experiments, we utilize five object detection models to estimate building heights and compare their results with ours.The object detection networks include three one-stage networks, SSD [80], YOLOv3 [81], RetinaNet [82], and a two-stage network, Faster R-CNN [64].Additionally, feature pyramid network (FPN) [83] is combined with Faster R-CNN in its backbone, termed as Faster R-CNN w.FPN, for better detecting objects at different scales.We denote the combined procedures of object detection and height estimation as SSD h , YOLOv3 h , RetinaNet h , Faster R-CNN h , Faster R-CNN w.FPN h .
For implementation, MMdetection [84] is employed for SSD h , YOLOv3 h , RetinaNet h , and Faster R-CNN w.FPN h , and the implementation in [85] is utilized for Faster R-CNN h .ResNet-101 is used as the backbone for RetinaNet h , Faster R-CNN h , and Faster R-CNN w.FPN h .For all the networks, the input is the concatenated SAR image and the building footprint mask.All input image patches are re-scaled to 256 × 256 pixels.Other default parameters in each implementation are kept.

Quantitative evaluation
The performance of networks is evaluated based on two criteria: the height accuracy and training time.We record the training time that each model takes for training on each data set and calculate building heights from the predicted bounding boxes, as stated in Section 2. The accuracy of retrieved building heights is measured by the mean (he mean ) and standard deviation (he std ) of height errors of all buildings H e : he mean = mean(H e ), he std = std(H e ). ( 7) H e = {h i true − h i predict |i = 1, ..., n}, where h i true and h i predict are the ground truth height and predicted height for the i-th building, respectively, and n is the number of test samples.
Table 5 reports numerical results of different models on four data sets, and Figure 10 shows histograms of height errors predicted by our network.We observe that Faster R-CNN h performs the best in all four data sets among all the networks.However, the results are only 0.1-0.2m better than those achieved by our network, which are trivial comparing to the absolute height errors (in the range of 4.3 m to 5.6 m).The results show that our networks, RetinaNet h , and Faster R-CNN h , outperform SSD h and YOLOv3 h in height accuracy.Interestingly, FPN did not bring improvement to Faster R-CNN h .One reason could be that the difference in the scale of building footprints is not particularly large.
On Berlin HS data set, all networks achieve the best performance in terms of height accuracy comparing to other data sets, owing to the higher spatial resolution of the spotlight image than the stripmap images.However, we notice that the differences are not significant.For instance, using stripmap images, the mean height error achieved by Faster R-CNN h ranges from 4.7 m to 5.6 m, and the standard deviation from 7.1 m to 7.6 m, depending on the data set.While using spotlight data (Berlin HS), the mean height error achieved is 4.3 m, and the standard deviation is 6.4 m, which is only 1-1.3 m better than those achieved using stripmap images.
In terms of the speed, our method significantly outperforms not only two-stage networks such as Faster R-CNN h but also fast networks like SSD h and YOLOv3 h .Comparing to Faster R-CNN h , the training time of our network reduces about 80%.The computation of the networks is reduced owing to the utilization of footprint bounding boxes.The fast training speed is particularly important when working with large data sets.
Our network outperforms the detection-based networks mainly due to the tailored use of building footprints, i.e., the module is designed to extract the footprint bounding box as the initial bounding box specifically for our task.The detection networks, on the other hand, lack the module specified for extracting building footprint information.They rely on a large amount of region proposals to obtain possible initial bounding boxes.In addition, our network provides one initial proposal, i.e., footprint bounding box, for each bounding box.However, the detection networks must provide multiple proposals in the earlier stage and rely on the classification scores to select the final bounding box in the later stage.Therefore, the computational cost of our network is much smaller.
To sum up, our network achieves accuracy comparable with Faster R-CNN h and much superior performance on speed by effectively using the multi-modal information contained in GIS data.The comparison of these results corroborates that the proposed network can significantly reduce the computational cost while keeping the height accuracy.

Qualitative evaluation
In addition to the quantitative evaluation, we visualize several predicted bounding boxes in Figure 11 and Figure 12.In both the two figures, the first two rows show building footprint masks and SAR image patches, and Row 3 to 7 present predicted bounding boxes from each model, in which the corresponding building footprint masks and SAR images are both plotted.The ground truth boxes and predicted boxes are plotted in green and red, respectively.
Figure 11 presents results of models in Berlin HS and Berlin SM data sets.We can observe a general improvement in quality from one-stage models to two-stage models and our network, especially for buildings in columns b2 and b6.All models can offer satisfactory results for buildings with larger footprints and clear signatures in the SAR image (e.g., the building in column b4).In contrast, for buildings with small footprints (see column b1) or ambiguous signatures (see column b6), one-stage models are not able to recognize full buildings.Besides, despite the resolution difference between the spotlight image and the stripmap image, the performance of all networks seems consistent.
Figure 12 visualizes results of models in Rotterdam and New York data sets.Similar results can be seen in columns b7 and b12 that all networks perform well when building signatures clearly distinguish with surroundings.On the contrary, the predictions for the building in column b8 are not satisfactory.The same can be observed on column b11 in a building with a complex shape.Moreover, examples in columns b9 and b10 show two buildings are both well detected by all the networks despite the distinct differences in their footprints' sizes, which also indicates that FPN may not enhance the precision of predicted bounding boxes.In summary, the proposed network has a similar performance with Faster R-CNN h .Reconstructed LoD1 building models and height prediction maps of our network in the whole areas of four data sets are plotted in Figure 15 -Figure 18.

Can our network work with inaccurate GIS data?
So far, we have employed highly accurate building footprints in our experiments as they are acquired from official data sets.However, many openly available building footprints often contain positioning errors.To test the performance of the proposed network in such cases, we conduct supplementary experiments on training our proposed network with inaccurate building footprints and discuss the impact of positioning errors in GIS data.
According to a quality assessment study of OpenStreetMap (OSM) in [86], the average offset of building footprints is 4.13 m with a standard deviation of 1.71 m.We generate inaccurate footprints by injecting positioning errors to building footprints Ftp, resulting in inaccurate footprint data termed as Ftp-E.We choose Berlin HS data set for this experiment and term the data set containing positioning error in building footprints as Berlin HS-E.
The procedure is illustrated in Figure 13.− → e is the added positioning error, and α is the angle between − → e and the range direction.We consider the positioning error as a variable whose magnitude is Gaussian distributed, i.e., | − → e | ∼ N (µ = 4.13, σ 2 = 1.71 2 ).Since the offset may point to different directions, we assume the direction of − → e is uniformly distributed, i.e., α is uniformaly distributed in the range of [0 • , 360 • ).For simplicity, let α be discrete: α ∼ DiscreteUniform(0 • , 359 • ).Note that this is the most difficult case that all footprints contain positioning errors.Then, we train our network on Berlin HS-E dataset and test the  trained network with a clean test set.The parameter settings of the network remain the same as previous experiments, as described in Section 4.2.
The results are listed in Table 6.As can be seen, compared to results obtained from Berlin HS, the mean height error is increased by 0.3 m, and the standard deviation of the height error is increased by 0.4 m.However, it still gives competent height estimation results.For visual comparison, Figure 14 shows the results of our network trained with Berlin HS and Berlin HS-E.As can be seen, for building b and c, the network trained with Berlin HS performs better.The predictions for buildings a and e are visually very similar.We observed that predictions from the network trained on Berlin HS-E are visually satisfactory for most buildings.
The experiments show that the proposed network is robust against the positioning errors in building footprint data.This finding suggests that a large amount of existing open-sourced GIS data, such as OSM, can be exploited for height estimation of individual buildings in SAR images.

Influences of the nonlocal filtering procedure on SAR data
In this work, we have employed original SAR amplitude images in our experiments.However, previous studies in [59,15] perform nonlocal filtering [87] on SAR images prior to training to reduce the speckle effect.To test the influence of the nonlocal filtering procedure for our networks, we conduct supplementary experiments to train the proposed network with nonlocal filtered SAR images.We perform denoising on SAR images using a nonlocal InSAR algorithm [87].Berlin HS data set is chosen for this experiment, and the nonlocal filtered data set is termed Berlin HS-NL.Then, we train and test our network and all the comparative networks on Berlin HS-NL dataset.The parameter settings of the networks remain the same as previous experiments, as described in Section 4.2 and Section 4.3.
Table 7 lists the results.As can be seen, results from Berlin HS and Berlin HS-NL data sets are very similar on all networks.The experiments show that the filtering procedure does not improve the results.We think the reason might be that the large amount of filters in CNNs, in fact, have filtering effects on speckle noises.This finding suggests that the filtering step is not needed for our task.Therefore, the computational cost for pre-processing can be largely reduced, which benefits especially for larger-scale processing.

Pros and cons of retrieving building height using bounding box regression networks
We have applied the proposed network to four data sets and retrieved building heights.On Berlin HS data set, the mean height error is 4.3 m.In CG-net [15], the building height achieved using a segmentation network from the same SAR data is 2.39 m.The advantage of CG-net in terms of height accuracy is obvious.However, as aforementioned in Section 1, pixel-wise labels are expensive, and it is not possible to generate training data for areas without accurate DEMs.Thus the applicability of CG-net is restricted.
The proposed regression network has two advantages.First, since the building height retrieval problem is formulated as a bounding box regression problem, the proposed method is capable of employing building height data from multiple sources.This enables the generation of annotation data on a larger scale and improves the transferability of the proposed networks.Second, the data set generation approach for bounding boxes is much simpler than the method for generating building areas in [15].This is crucial when processing large data sets, e.g., on a regional or even larger scale.
Comparing to the results of CG-net, building heights predicted using the proposed network have lower accuracy.The results are, however, still good.In [11], the authors retrieved heights of 40 isolated buildings from high-resolution spotlight TerraSAR-X images.For three building categories based on roof shapes, this work reported the mean height errors between -1 -3.4 m and the standard deviation of height errors between 1.3 -5.8 m.Our experiments employed both spotlight and stripmap SAR images and conducted performance testing on large amounts of data, for instance, 10K in Berlin HS data set (cf.Table 4).Considering the image resolution of SAR data and the size of our data sets, the proposed method is very competitive.
In summary, these comparisons suggest that the proposed bounding box regression network has great potential for applications aiming at large scales, e.g., to reconstruct baseline models on a regional or even global scale.When accurate DEMs are available, segmentation networks such as CG-net [15] are preferred for higher accuracy on the reconstructed building heights.

Conclusion
This work proposes a method that retrieves building heights in large-scale urban areas from a single TerraSAR-X spotlight or stripmap image.We formulate the problem of building height retrieval as a bounding box regression problem and develop a network that takes SAR images and building footprints as input and retrieves building heights by predicting building bounding boxes.Corresponding to the method, we propose a ground truth generation approach that only requires the footprint and one height value for each building.This approach can integrate multiple sources of building heights, such as open building models, LiDAR, and DEMs, thus can generate annotation data on larger scales and provide large potential in analyzing complex urban regions.
Four study sites are used to test the proposed networks, including one high-resolution spotlight TerraSAR-X image in Berlin and three stripmap TerraSAR-X images in Berlin, Rotterdam, and south Brooklyn in New York City.The mean height error achieved in the four sites ranges from 4.3 m to 5.7 m.The results are significant, as they are achieved from a single stripmap/spotlight TerraSAR-X image.Compared to methods utilizing object detection networks for building height retrieval, the proposed network can significantly reduce the computational cost while keeping the height accuracy of individual buildings compared to Faster R-CNN h .Further experiments of training the networks using inaccurate building footprint data suggest that the proposed network is robust in the presence of positioning errors in building footprints, that a large amount of existing open-sourced GIS data, such as OSM, can be exploited for this task.
In the future, we are interested in improving the height accuracy of bounding box regression networks and producing LoD1 building models using SAR images in the stripmap mode on regional or even global scales.We are also interested in domain adaptation techniques to mitigate the influences of the data differences between different areas and increase the transferability of our models.

Figure 1 :
Figure 1: Illustration of the input and output of our method.(a) and (b) are the input data: a SAR image and building footprints in the SAR image.(c) shows the predicted bounding boxes of these buildings.Building heights are then computed from the bounding boxes and building footprints, and levels-of-detail (LoD) 1 models are reconstructed, as shown in (d).rg and az denote the slant range direction and azimuth direction, respectively.

Figure 2 :
Figure 2: Illustration of the amplitude profile (ampl) of two flat-roof buildings in a slant-range SAR image.θ is the incidence angle.h is the building height.lw, lr, and lf denote the areas of wall, roof, and footprint in the slant-range SAR image, respectively.The gray shades and heights of regions a-f indicate the expected magnitude values of intensity on the SAR image.The blue arrow marks the bottom of the sensor-facing wall and the red arrow points at the double bounce line on the SAR image.

Figure 3 :
Figure 3: Illustration of bounding boxes of two buildings in a slant-range SAR image.On the left, two buildings (b1 and b2) in the UTM coordinate system are imaged in a SAR image plane.On the right, bounding boxes of b1 and b2 are shown in the SAR image coordinate system.

Figure 4 :
Figure4: General workflow of the proposed method.After data set generation, our network concatenates a SAR image and a building footprint mask from the training set as input and predicts a correction for the footprint bounding box with respect to the building bounding box.Then the trained model is evaluated using the testing set.Building heights are computed from the predicted bounding boxes and building footprints, and subsequently, LoD1 building models are reconstructed.

Figure 5 :
Figure5: The workflow for dataset generation in three steps.First, building footprints and height data are collected in the UTM coordinate system; then, they are projected to the SAR image coordinate system to generate building footprint masks; third, building bounding boxes are generated using footprint masks and building heights.

Figure 6 :
Figure 6: Examples of three different building height sources in the same area in the city of Berlin, Germany.

Figure 7 (
b) and (c) show the spotlight image and the stripmap image in the yellow rectangle in (a), respectively.The study regions in Rotterdam and New York are shown in Figure8and Figure9, respectively.

( a )
Study area (blue region) in Berlin: the intersection of the spotlight SAR image (black rectangle)) and the DEM area (red rectangle).(b) Spotlight SAR image in the yellow rectangle in (a).(c)Stripmap SAR image in the yellow rectangle in (a).

Figure 7 :
Figure 7: The study area of both Berlin HS and Berlin SM data sets.(a) shows the area (blue) in the UTM coordinate system (UTM zone 32N).(b) and (c) show a comparison of the TerraSAR-X spotlight image and the stripmap image in the yellow rectangle in (a), respectively.

Figure 8 :
Figure 8: The Rotterdam study area in the UTM coordinate system (UTM zone 31N).

Figure 9 :
Figure 9: The New York study area in the UTM coordinate system (UTM zone 18N).

Figure 10 :
Figure 10: Histogram of building height errors predicted with our network in the study areas.

Figure 11 :Figure 12 :
Figure 11: Examples of predicted bounding boxes using different networks in Berlin HS and Berlin SM datasets.The predicted and ground truth bounding boxes are marked in red and green, respectively.The ground truth bounding boxes are obtained using the procedure described in Section 3.3.

Figure 13 :Figure 14 :
Figure 13: Illustration of generating building footprints with positioning errors.Positioning error − → e is added to building footprint Ftp, resulting in Ftp-E.rg and az denote the range direction and azimuth direction, respectively.α is the angle between − → e and rg.

Figure 15 :
Figure 15: Height prediction map in Berlin HS dataset.(up) Reconstructed LoD1 building models overlaid on the SAR image.(down) Height prediction map in the SAR image coordinate system.Height is color-coded.

Figure 16 :
Figure 16: Height prediction map in Berlin SM dataset.(up) Reconstructed LoD1 building models overlaid on the SAR image.(down) Height prediction map in the SAR image coordinate system.Height is color-coded.

Figure 17 :
Figure 17: Height prediction map in Rotterdam dataset.(up) Reconstructed LoD1 building models overlaid on the SAR image.(down) Height prediction map in the SAR coordinate system.Height is color-coded.

Figure 18 :
Figure 18: Height prediction map in New York dataset.(up) Reconstructed LoD1 building models overlaid on the SAR image.(down) Height prediction map in the SAR image coordinate system.Height is color-coded.

Table 2 :
Numerical results on Berlin HS data set for choosing the batch size.All experiments are trained for 10 epochs.The highest values of the metrics are highlighted in bold.

Table 3 :
Numerical results on Berlin HS data set for choosing the training epochs.All experiments use the batch size of 4. The highest values of the metrics are highlighted in bold.

Table 4 :
Patch size and sample numbers in each data set.

Table 5 :
Numerical results on four data sets.The highest values of different metrics are highlighted in bold.

Table 6 :
Numerical results obtained from Berlin HS and Berlin HS-E data sets.

Table 7 :
Numerical results on four data sets.The highest values of different metrics are highlighted in bold.