Building Segmentation through a Gated Graph Convolutional Neural Network with Deep Structured Feature Embedding

Automatic building extraction from optical imagery remains a challenge due to, for example, the complexity of building shapes. Semantic segmentation is an efficient approach for this task. The latest development in deep convolutional neural networks (DCNNs) has made accurate pixel-level classification tasks possible. Yet one central issue remains: the precise delineation of boundaries. Deep architectures generally fail to produce fine-grained segmentation with accurate boundaries due to their progressive down-sampling. Hence, we introduce a generic framework to overcome the issue, integrating the graph convolutional network (GCN) and deep structured feature embedding (DSFE) into an end-to-end workflow. Furthermore, instead of using a classic graph convolutional neural network, we propose a gated graph convolutional network, which enables the refinement of weak and coarse semantic predictions to generate sharp borders and fine-grained pixel-level classification. Taking the semantic segmentation of building footprints as a practical example, we compared different feature embedding architectures and graph neural networks. Our proposed framework with the new GCN architecture outperforms state-of-the-art approaches. Although our main task in this work is building footprint extraction, the proposed method can be generally applied to other binary or multi-label segmentation tasks.


Introduction
Building footprint generation is an active topic in remote sensing field.Recently, it has received considerable attention due to its huge potential in autonomous driving, virtual reality, urban planning, environmental, and demographic applications.Manual extraction of buildings from optical images is time consuming and difficult in largescale practice.In contrast, semantic segmentation is a comparatively inexpensive and time-saving technique for extracting building footprints.It aims to classify each pixel with a corresponding class.Various semi-automatic and automatic methods [1] [2] [3] [4] have been developed to improve segmentation accuracy within this method; traditionally, feature extraction and classification are its two main steps.The extraction of such handcrafted features usually require a strong domain-specific knowledge.
In recent years, the use of deep learning has garnered great success in semantic segmentation.In particular, deep convolutional neural networks (DCNNs) have shown promising results, due to their high capacity for data learning.DCNNs [5] have instigated compelling advancement over traditional semantic segmentation methods.
However, exploiting DCNN for semantic segmentation tasks still raises significant challenges.The convolution layer of a DCNN is a weights sharing architecture, and it has both shift invariant and spatial invariant characteristics.While the invariance is clearly desirable for high-level vision tasks, it may hamper low-level tasks such as pose estimation and semantic segmentation, where precise localization is required rather than abstraction of spatial details.For instance, the coarse segmentation output such as nonsharp boundaries and blob-like shapes is caused by convolution filters with large receptive fields and pooling layers in DCNN.Moreover, DCNN fails to fine local details without the consideration of the interactions between pixels.
To overcome these issues, the probabilistic graph models, such as the conditional random field (CRF) [7] and Markov random field (MRF) [6], have been introduced to connect with DCNNs at the final layer.To use CRF for semantic segmentation, the main concept is to transform the problem of pixel-wise classification into a problem of probabilistic inference, which assumes similar pixels should have the same labels.This substantially improves the predictions of the pixel-wise labels to generate precise borders and exhaustive segmentation.In [7], instead of using CRF as the post-processing step, the authors propose an end-to-end architecture that combines the FCN with a fully connected CRF.However, these frameworks have not sufficiently extracted the features from the images.Different level features have different properties for semantic segmentation.Since low-level characteristics are rich with spatial details but lack semantic information and high-level characteristics are conversely, they are naturally complementary.Another issue with CRF is that information propagation is not sufficient.
In this work, we propose a generic framework for semantic segmentation in this work, which integrates deep structured feature embedding and the graph convolutional network.In order to extract more comprehensive and representative features, we exploit deep structured feature embedding techniques to enhance the feature fusion by incorporating multi-level characteristics.Furthermore, we propose a new graph convolutional network, the gated graph convolutional network (GCN).GCN can aggregate the information from neighbor nodes (short range), which allows the model to learn about local structures.A recurrent neural network (RNN) with gated recurrent units (GRUs) has proven successful to model the long-term dependencies in sequential data.Hence, we adopt RNN with GRUs for long-range information propagation.The proposed network integrates the two architectures together, thus taking into account both local and global contextual dependencies.It is useful for semantic segmentation tasks.As a consequence, DSFE-GGCN is a trainable end-to-end framework.We show that joint learning of deep structured feature embedding and GGCN parameters results in considerable performance 2 gains.

Contributions
The contributions of this work are summarized as follows: • A generic framework for semantic segmentation is proposed, which integrates the deep structured feature embedding and a graph convolutional neural network into an end-to-end workflow.
• We propose a novel network architecture, called a "gated graph convolutional neural network," which combines the RNN with GRUs for long distance information propagation and the GCN for short distance information propagation.
• An effective four-step preprocessing approach is proposed for data augmentation, especially for medium-resolution satellite imagery.
• The performance of different DCNNs and the proposed framework is analyzed through a systematic investigation.Our framework with GGCN surpasses the state-of-the-art approaches to building footprint extraction.

Semantic segmentation with DCNNs
The fully convolutional network (FCN) was first proposed in [8] for the task of semantic segmentation, in which convolutional layers take the place of fully connected layers.FCN makes the training more efficient and the input size of inference arbitrary.A more memory-efficient approach that used an alternative decoder variant, SegNet, was proposed in [9].The stored indices of the max-pooling step in the downsampling path is used by the decoder for the operation of upsampling.Another variant of the encoderdecoder architecture is U-Net [10].The long skip connections in the network enables the recovery of the downsample-induced information lost in the encoder.
One key issue for fully convolutional neural networks is that the spatial resolution is significantly downsampled, which is caused by the operations, such as strided convolutional layers or pooling layers.In order to overcome the poor localization property, [11] proposed another approach to improve the spatial resolution, using a probabilistic graph model CRF to achieve fine-grained boundaries.Instead of using CRF as a postprocessing step, DeepLab-CRF [7] introduces a fully connected CRF layer, which leads to an end-to-end trainable network.

Graph model
A graph model is a probabilistic model that encodes a distribution based on a graphbased representation.The Markov random field (MRF) is one classic graph model, which uses an undirected graph to describe the joint probability distribution of random variables.It has been applied to many tasks of image processing, including image coregistration, image segmentation, and image super-resolution.MRF takes into account the relationships of the neighbours to infer the maximal possibility of the pixel's label.
The conditional random field (CRF) is an extension of MRF, which models the conditional probability distribution instead of the joint probability distribution.CRF as a discriminative model shows a better performance when the samples are limited.The combination of DCNNs and the graph model CRF [11,7] can produce high-resolution prediction for better segmentation.
Recent work [23] has extended DCNNs to topologies that differ from the low-dimensional grid structure.Due to significant computational drawbacks, it is impractical for realworld use.Henaff et al. [24] and Defferrard et al. [25] further improve GCN to successfully overcome this issue.The grid-like data can be interpreted as a special type of graph data, where the node is on the grid and the number of neighbours is fixed.In this work, we propose a gated graph convolutional network, which is a trainable inference systems based on GCN and RNN with GRUs.

Building footprint extraction
Building footprint generation is currently exciting a great deal of interest, and is an active field of research in the fields of remote sensing, photogrammetry and computer vision.The established building footprint maps are used in many important applications to analyze the process of urbanization, such as urban growth and sustainable urban development.
In [17], the authors propose a multi-stage ConvNet with an upsampling operation of bilinear interpolation.The trained model achieves a superior performance on very-highresolution aerial imagery.Recently, an end-to-end trainable active contour model (ACM) was developed for building instance extraction [18], which learns ACM parameterizations using a DCNN.In [19], a residual refinement network was proposed to extract the building footprint using aerial images and LiDAR point clouds.In [20], the authors exploit the improved conditional Wasserstein generative adversarial network to generate the building footprint automatically.Recent work [21] has shown that most of the tasks, such as building segmentation, building height estimation, and building contour extraction, are still difficult for modern convolutional networks.In this work, we show a significant performance improvement in building footprint extraction by using our proposed novel framework.

Methodology
The details of the DSFE-GGCN framework are introduced in this section.The workflow of the proposed method is shown in Fig. 1.An image can be generalized as a graph, whose nodes are on the two-dimensional grid.Each pixel represents a node.The embedding vectors can be computed initially from node inputs, e.g., node type embeddings, and then propagated on the graph to aggregate information from the local neighborhood.

Deep structured feature embedding
Deep embedding methods typically map images into an embedding space, where their distances preserve the relative similarity.In general, the representations of the data can be learned by graph embedding techniques [22], which take into account the relationships of the data.In addition, data from different sources, such as images, point clouds, and social media data, can be transformed into feature space, which can be further used for segmentation or other tasks.In this study, the data source is only imagery.Hence, we exploit a more efficient approach for feature embedding that uses DCNNs as feature extractor.
However, the resolution of the later layers in the neural network is extremely downsampled, a phenomenon that is caused by strided convolution, max-polling, or other operations.Several methods have been introduced to decipher precise information from the downsampled feature maps.One common approach is to utilize interpolation techniques [9], which is both computationally cheap and memory-saving.An alternative is deconvolution, in which recorded indices of the polling operation are used to retrieve information from the feature maps [15].Recently, long skip connections between the contracting and expanding paths were introduced to retrieve detailed spatial information from the high-level feature layers [10].In combination with DenseNet block [13], FC-DenseNets was proposed in [12], where the upsampling path was composed of deconvolution, unpooling, and long skip connections.Consequently, all the feature maps from deconvolution, unpooling, or skip-connections are exploited for the computation in the upsampling path of the dense blocks.Moreover, recent work shows that multiple DCNN features extracted from different networks can be complementary, and could be fused to improve segmentation accuracy.However, the method for fusing multiple features is still an open problem that needs systematic investigation.
As mentioned above, low-level features yield better representation of localization and high-level features can give more comprehensive semantics.Therefore, in this work, we concatenate different level features progressively in order to propagate information about localization, semantics, and other properties through graph convolutional neural networks.

Gated graph convolutional neural network
An undirected and connected graphs G = (V, E) consists of a set of nodes V and edges E. The unnormalized graph Laplacian matrix L is defined as: where A is the adjacency matrix representing the topology of G, and D is the degree matrix, which is calculated by The properties of the graph Laplacian L are symmetric, positive, and semi-defined; therefore the eigenvalue decomposition can be expressed as: where Φ = (φ 1 , φ 2 , ..., φ n ) are the orthonormal eigenvectors, known as the graph Fourier modes, and Λ = diag (λ 1 , λ 2 , ..., λ n ) are the eigenvalues of L, which is a non-negative diagonal matrix.Assuming a signal f on the graph nodes V, its graph Fourier transform can be formulated as f = Φ T f .If g is a filter, the convolution of f and g is written as where ĝ is the spectral representation of the filter.Rather than computing the Fourier transform ĝ, the filter coefficients can be parameterized as ĝ = r k=0 α k β k , ask shown in [24].With the polynomial parametrization of the filter, the spectral filter is exactly localized in space.Moreover, the learning complexity is O(r), the filter support size, and the same complexity as classical DCNNs.
In order to avoid explicit multiplication in the spectral domain, alternatively, the spectral representation ĝ of the filter g can be approximated by a Chebyshev polynomial expansion g(Λ), which is formulated as: where T k ( Λ) is the Chebyshev polynomials.The graph convolution can be defined as: where L = 2/λ max • L − I, and λ max is the maximal eigenvector.In [26], the authors further simplify the Chebyshev framework, setting r = 1 and assuming λ max ≈ 2, allowing them to redefine a single convolutional layer as simply: where H is the hidden layer.By taking into account the self-connections, the original adjacency matrix of the graph G is transformed to Ã = A + I, where I is the identity matrix.W is the trainable weight matrix and the new degree matrix D can be calculated by Dii = j Ãij .The function σ r (•) denotes a nonlinear activation function.This simplified form improves computational performance on larger graphs and predictive performance on small training sets.

Propagation model
The propagation process can be formulated as: where a t i is the message layer at time step t, which represents the messages propagated from its neighbours V i to the node i.The message layer a t i at time step t serves as input to update the hidden layer with function F. Our proposed method is to use GCN as the message function, which makes it easy for the propagation model to learn to propagate the node embeddings for node i to all nodes reachable from i.We adopt gating techniques to surpass GCN performance, because its own memory can be maintained and the valuable information from neighbours can be gathered with its aid.
The unrolled propagation model at timestep t can be written as: where r and z are the reset and update gates, and W r , W z , U r , U z are learnable weights for different gates.The function σ r is the ReLU function, σ s is the logistic sigmoid function, and • is interior product.The initial hidden representation of the corresponding node is taken from the feature vectors of the DSFE step.For a certain time step t, the messages from the neighbourhoods of the node are aggregated by using GCN.After that, the hidden state of next time step t + 1 is updated by gated recurrent units, which use the hidden state h t i and the message a t+1 i at time step t as input.With the help of the reset gate and the update gate in GRU [29], the node can maintain its own memory and extract useful information from incoming messages.Along with the increase of the time step, it is capable of capturing the long range dependencies, which has been difficult to model in vanilla GCN.

Prediction model
The node classification is defined as: Since we have transferred the binary semantic segmentation problem to the multi-label pixel labeling task, a softmax with negative log-likelihood loss function is used to predict the probability of each node.

Datasets
In this work, we use Planetscope satellite imagery [30] with three channels (R, G, B) at a 3 m spatial resolution.The imagery is acquired by Doves, which can provide complete coverage of Earth once a day.The study sites cover four cities: (1) Munich, Germany; (2) Rome, Italy; (3) Paris, France; and (4) Zurich, Switzerland.The corresponding building footprint layer is downloaded from OpenStreetMap (OSM) [31].The images are cropped with a patch size of 64 × 64.The overlap of each patch has 19 pixels in one direction.At the end, 48,000 sample patches are generated.The training data has 80% of the patches and the testing data has 20% of the patches.The training and testing data is spatially separated.

Preprocessing
The datasets utilized in this work consist of Planetscope satellite imagery and OSM building footprints as ground truth.However, since data sources for OSM are different from satellite imagery, there are likely inconsistencies between OSM building footprints and satellite imagery.Therefore, we need to carry out preprocessing steps to limit the inconsistencies before the experiments, which include band normalization, coregistration, refinement, and a truncated signed distance map (TSDM) (see Fig. 2).In the next section, we will mainly focus on the coregistration and TSDM steps.

Coregistration
One inconsistency is misalignments between OSM building footprints and satellite imagery, which is caused by different projections and accuracy levels from data sources.Fig. 3 (a) shows an example of and OSM building footprint overlaid with the corresponding satellite imagery.There are noticeable misalignments between the building footprint and the satellite imagery.These misalignments lead to inaccurate training samples, which need to be corrected.The coregistration process includes several steps: (1) The satellite imagery is transformed from RGB to gray scale; (2) The Gaussian gradient of grayscale imagery is calculated; (3) The cross correlation between the gradient magnitude of the grayscale image and building footprints is computed; (4) The pixel with the maximum cross correlation is found and the offset in both row and column direction can be derived.Fig. 3 (b) shows the result after coregistration.

Truncated signed distance map
In order to incorporate both semantic information about class labels and geometric properties in the training of the network, the distances of pixels to boundaries of buildings are extracted as output representations.In our experiment, the value of the signeddistance function (SDF) is determined by the distance between the pixel and its nearest point on the boundary.Positive values imply that the pixels are within the buildings and negative values indicate the outside of buildings.
Then we truncate the distance at a given threshold to incorporate only the pixels closest to the border.In this case, the problem in our research is a multi-label segmentation task, which enhances the result of prediction by the detailed signed distance map.The truncated signed distance function can be expressed as: where min x∈X (d(x)) denotes the euclidean distance d(x) between the pixel and its nearest point on the boundary of the building.The term δ d is a sign function with the implication of inside or outside of objects; T d is the truncated threshold.

Experimental setup
We use 11 classes for the truncated signed distance map, which is in [0, 10] and the truncated threshold is set to 5. For all networks, a stochastic gradient descent (SGD) is used and the learning rate is set to 10 −4 .The negative log likelihood loss (NLL-Loss) is adopted as the loss function.The proposed framework is implemented using Pytorch.Experiments are run on a NVIDIA Tesla P100 16 GB GPU.Several semantic segmentation methods, which include FCN-32s, SegNet, FCN-16s, U-Net, FCN-8s, ResNet-DUC, CWGAN-GP, FC-DenseNet, GCN, GraphSAGE, and GGNN, are chosen as the algorithms of comparison.

Numerical results
The three metrics in the following experiments selected to evaluate the results are: overall accuracy (OA), F1 scores, and the Intersection over Union (IoU) scores.The experiments are carried out in following way.First, as a baseline, we assess the capability of different deep convolutional neural networks for building footprint extraction.Then, we choose different DCNNs for deep structured feature embedding and combine it with GCN [26] to decide which DCNN is the best feature extractor for our proposed framework.At the end, we use the best feature extractor for DSFE and compare the proposed framework to different graph models.

Baseline with different DCNNs
In this section, the performance of the state-of-the-art DCNNs for building footprint generation are firstly investigated, which indicates the capability of each DCNN for feature extraction and precise localization.FCN-32s and FCN-16s exhibit poor performance, since the feature map of later layers have only high-level semantics with poor localization.ResNet-DUC can achieve better result than the previous two because of hybrid dilated convolution and dense upsampling convolution.However, it is limited due to the lack of skip connections.Max-pooling indices are reused in SegNet during the decoding process, which can reduce the parameter number of network leading to efficient training.However, as it only use indices of maxpooling to decoder, some local details cannot be recovered, e.g., small buildings will be neglected.FCN-8s and U-Net outperform previous networks due to the concatenation of low-level features.Compared to the other CNN models, cwGAN-gp shows promising results for building footprint generation.The enhancement of performance is motivated by the min-max competition between the discriminator and the generator of the GAN.FC-DenseNet outperforms all other semantic segmentation neural networks in numerical accuracy and visual results.On one hand, DenseNet block concatenates different features learned by convolution layers, which can boost the input diversity of subsequent layers and promote better efficiency of the training.On the other hand, the detailed spatial information can be propagated by shortcut connections between the convolution and deconvolution paths, which enhances the recovery of fine-grained segmentation from the deconvolution path.

Proposed framework with different DSFE
In order to choose the best feature extractor for our task, three representative DCNNs have been adopted in the proposed framework with the graph convolutional network.The statistical result is shown in Table 2.

OA F1 IoU
DSFE(U-Net)-GCN 0.8396 0.6258 0.4544 DSFE(FCN-8s)-GCN 0.8594 0.6320 0.4611 DSFE(FC-DenseNet)-GCN 0.8640 0.6677 0.5012 From Table 2 we can see that different DCNNs exhibit different capabilities for feature embedding.It is clear that FC-DenseNet, as a feature extractor in DSFE with GCN, produces the best result.This is due to the superiority of FC-DenseNet, which extends the DenseNet architecture to a U-Net-like network for semantic segmentation.In the DenseNet block, through feature reuse, there are shorter connections between layers close to the input and those close to the output, which force the intermediate layers to learn discriminative features.Moreover, DenseNet combines features by iteratively concatenating them, which contributes to improved information and gradient propagation in the networks.
As can be seen in Fig. 5, DSFE (FC-DenseNet)-GCN gives the best result, which implies that FC-DenseNet is a powerful tool for extracting different levels of features.

Proposed framework with different graph models
In this section, we choose FC-DenseNet as the feature extractor in DSFE with different graph models.The results are summarized in Table 3.
The results show that DSFE-GGCN has the best performance for our task.The IoU increases 6.2% compared to the best result of DCNN.Fig. 6 shows a visual comparison of all the networks used in section 4. We marked the key region with a yellow bounding box.The close-up figures for the key regions are shown in Fig. 7.

Additional dataset
We validate our proposed method with experiments on the ISPRS 2D Semantic Labeling Contest dataset, which covers the city of Potsdam and comprises 38 tiles of aerial imagery [32].In order to maintain the consistency, images with 3 spectral bands (red, green, blue) are used in this experiment without a digital surface model (DSM).Each aerial image is depicted with 6000×6000 pixels at a spatial resolution of 5 cm.The corresponding ground truth is also provided for results evaluation, which includes six classes: Impervious surfaces, Building, Low vegetation, Trees, Cars, and Clutter/background.For our detailed experiments, we split those 38 tiles into a training subset (tile numbers 2-10 to 6-15) and a test subset (tile numbers 7-07 to 7-13).The building class is regarded (a)

OA F1 IoU
FC-DenseNet [12] 0.8551 0.6328 0.4628 DSFE-CRF [7] 0.8592 0.6415 0.4757 DSFE-GCN [26] 0.8640 0.6677 0.5012 DSFE-GraphSAGE [27] 0.8719 0.6726 0.5067 DSFE-GGNN [28] 0.8787 0.6778 0.5123 DSFE-GGCN 0.8881 0.6899 0.5251 as a building and other five classes are considered non-buildings.We cut 16,000 patches of 256 × 256 pixels from the training subset and 3573 patches from the test subset.As mentioned in the previous section, the data augmentation step TSDM is used for the medium-resolution images and the ground truth is well coregistrated with the optical image.Therefore, there is no data preprocessing step for the ISPRS dataset.The optical image is fed directly into the networks.

Experimental setup
The SGD optimizer is adopted and the initial learning rate is set to be 10e-4, which is reduced by a factor of ten when the validation loss is saturated.Once the learning rate is reduced below 10e-8, the training stops.The number of epochs is in the range (120, 160) for all the networks.The size of the training batch is 4.

Experimental results
The metrics OA, F1 scores, and IoU scores are used to evaluate the results.Fig. 8 shows the visualized comparison of the predicted results the ISPRS Potsdam dataset, using different networks.
FCN-8s provides a significantly higher percentage of buildings detected compared to FCN-16s and FCN-32s, by combining predictions from not only the final layer but also coarse layers, allowing more information to be preserved.The boundaries of buildings detected from U-Net are sharper than for SegNet or E-Net.However, unlike in the medium-resolution case, the completeness of the result obtained by SegNet or E-Net is better than for U-Net, which indicates that the spatial information propagation is more effectively undertaken by recording the pooling indices than by concatenating the low-level features when the resolution is high enough, i.e., when comprehensive spatial information exists.The finer details are captured by the proposed framework with different graph models such as CRFasRNN, GCN, and GGCN rather than CNN-only methods, which confirms the effectiveness of the graph model in modelling the interaction among pixels and spatial information propagation.Compared to CRFasRNN and GCN, the proposed GGCN method gives a better result.A close-up view of the key region is shown in Fig. 9.It can be seen that the DSFE-GGCN shows a better result with respect to both completeness and sharpness for building extraction compared to other methods.(g) (m) (n) Table 4 summarizes the results of using different deep convolutional neural networks and the proposed framework on the ISPRS dataset.As can be seen the proposed DSFE-GGCN/DSFE-GCN framework contributes a significant improvement over the DCNNs.Moreover, compared to DSFE-GCN, DSFE-GGCN can effectively propagate the information in the short-and long-range, which leads to better results.

Conclusion
In this work, we develop a novel framework for semantic segmentation thatcombines the deep structured feature embedding and a graph convolutional network.Specifically, we propose using a gated graph convolutional network to improve the information propagation by using RNN with GCN.Our proposed framework outperforms the state-of-theart methods for building footprint extraction.Although we have used building footprint extraction as the practical application, the proposed method can be generally applied to other binary or multi-label segmentation tasks, such as road extraction, settlement layer extraction, or semantic segmentation of very high resolution data in general.In addition, the proposed GCN network can work directly with unstructured data, such as point clouds and social media text messages.

Figure 1 :
Figure 1: An illustration of the proposed DSFE-GGCN framework.The initial hidden representation of the corresponding node is taken from the feature vectors in the DSFE step.For a certain time step t, the messages from the neighbourhoods of the node are aggregated by using GCN.After that, the hidden state of the next time step t + 1 is updated by gated recurrent units, which use the hidden state h t i

Figure 2 :
Figure 2: Illustration of preprocessing step
This work is supported by the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement

Table 1 :
Comparison of different deep convolutional neural networks on the test datasets

Table 2 :
Quantitative comparison of different deep neural networks on Planetscope's datasets

Table 3 :
Comparison of different networks on the Planetscope dataset

Table 4 :
Comparison between different deep convolutional neural networks and proposed framework on the ISPRS dataset