Deep multi-task learning for a geographically-regularized semantic segmentation of aerial images

https://doi.org/10.1016/j.isprsjprs.2018.06.007Get rights and content

Abstract

When approaching the semantic segmentation of overhead imagery in the decimeter spatial resolution range, successful strategies usually combine powerful methods to learn the visual appearance of the semantic classes (e.g. convolutional neural networks) with strategies for spatial regularization (e.g. graphical models such as conditional random fields).

In this paper, we propose a method to learn evidence in the form of semantic class likelihoods, semantic boundaries across classes and shallow-to-deep visual features, each one modeled by a multi-task convolutional neural network architecture. We combine this bottom-up information with top-down spatial regularization encoded by a conditional random field model optimizing the label space across a hierarchy of segments with constraints related to structural, spatial and data-dependent pairwise relationships between regions.

Our results show that such strategy provide better regularization than a series of strong baselines reflecting state-of-the-art technologies. The proposed strategy offers a flexible and principled framework to include several sources of visual and structural information, while allowing for different degrees of spatial regularization accounting for priors about the expected output structures.

Introduction

This paper deals with parsing decimeter resolution abovehead images into semantic classes, relating to land cover and/or land use types. We will refer to this process as semantic segmentation. For a successful segmentation, one requires visual models able to disambiguate local appearance by understanding the spatial organization of semantic classes (Gould et al., 2008). To this end, machine learning models need to exploit different levels of spatial continuity in the image space (Campbell et al., 1997, Shotton et al., 2006). Accurate land cover and land use mapping is an active research field, growing in parallel to developments in sensors and acquisition systems and to data processing algorithms. Applications ranging from environmental monitoring (Asner et al., 2005, Giménez et al., 2017) to urban studies (Zhong and Wang, 2007, Jat et al., 2008) benefit from advances in processing and interpretation of abovehead data.

Semantic segmentation of sub-decimeter aerial imagery is often tackled by Markov and conditional random fields (MRF, CRF) (Besag, 1974, Lafferty et al., 2001) combining local visual cues (the unary potentials) and interaction between nearby spatial units (the pairwise potentials) (Kluckner et al., 2009, Hoberg et al., 2015, Zhong and Wang, 2007, Shotton et al., 2006, Volpi and Ferrari, 2015). By maximizing the posterior joint probability of a CRF over the labeling (i.e. minimizing a Gibbs energy), one retrieves the most probable labeling of a given scene, i.e. the most probable configuration of local label assignments over the whole image space. These frameworks allow to model jointly bottom-up evidence, encoded in the unary potentials, together with some domain specific prior information encoded in the spatial interaction pairwise terms.

The idea behind the proposed model is that, when dealing with urban imagery (and in general decimeter resolution imagery), both the content of the image and the classes are highly structured in the spatial domain, calling for data- and domain-specific regularization. To follow such intuition, we model two key aspects of spatial dependencies: input and output space interactions. The former are usually encoded by operators accounting for the spatial autocorrelation of pixels in their spatial domain. The latter are encoded by different kinds of pairwise potential, favoring specific configurations issued from a predefined prior distribution.

  • To extract information about local input relations, we combine state-of-the-art convolutional neural networks (CNN, LeCun et al., 1998, Simonyan and Zisserman, 2015, Krizhevsky et al., 2012) providing data-driven cues for multiple tasks: We employ a CNN to not only provide approximate class-likelihoods, but also to predict semantic boundaries between the different classes. The latter coincide usually with natural edges in the image, but also corresponding to changes in labeling. Then, we build a segmentation tree using the semantic boundaries predicted by the CNN. Such tree represents hierarchy of regions spanning from the lowest level defined by groups of pixels (or superpixels) to the highest level, the whole scene. The region partitioning depends jointly on shallow-to-deep visual features and the semantic boundaries learned by the multi-task CNN.

  • To account for the output relations between regions, we combine the information within each region in a hierarchy using a top-down graphical model including different key aspects of the spatial organization of labels, given the observed inputs. This second modeling step is based on a CRF that aims at reducing the complexity (i.e. regularizing) of the pixel-wise maps, by semantically and spatially parsing consistent regions of the image, likely to belong to given classes, at different scales. Specifically, the CRF model takes into account evidence from the CNN (class-likelihoods, learned visual features and presence of class-specific boundaries) and spatial interactions (label smoothness, label co-occurrence, region distances, elevation gradient) within the hierarchy. In other words, it learns the extent and the labeling of each segment simultaneously, by minimizing a specifically designed energy.

A visual summary of the proposed pipeline is presented in Fig. 1.

We evaluate all the components of the system and show that spatial regularization is indeed useful in simplifying class structures spatially, while achieving accurate results. Since spatial structures are learned and encoded directly in the output map, we believe our pipeline is a step towards systems yet based on machine learning, but not requiring extensive manual post-processing (e.g. local class filtering, spatial corrections, map generalization, fusion and vectorization Crommelinck et al., 2016, Höhle, 2017), at the same time employing domain knowledge and data specific regularization, tailoring it to specific application domain and softening black-box effects. Specifically, the contributions of this paper are:

  • A detailed explanation on our multi-task CNN, building on top of a pretrained network (VGG).

  • A strategy to transform semantic boundaries probabilities to superpixels and hierarchical regions.

  • A CRF encoding the desired space-scale relationships between segments.

  • The combination of different energy terms accounting for multiple input-output relationships, combining bottom-up (outputs and features of the CNN) and top-down (multi-modal clues about spatial arrangement) into local and pairwise relationships.

In the next section, we summarize some relevant related works. In Section 3, we present the proposed system: the multi-task CNN architecture (Section 3.1), the hierarchical representation of image regions (Section 3.2) and the CRF model (Section 3.3). We present data and experimental setup in Section 4 and the results obtained in Section 5. We finally provide a discussion about our system in Section 6, leading to conclusions presented in Section 7.

Section snippets

Mid-level representations

To generate powerful visual models, traditional methods compute local appearance models mapping locals descriptors to labels, over a dense grid covering the image space. Then, the relationships between output variables are usually modeled by MRF and CRF. Standard approaches to local image descriptors involve the use of local color statistics, texture, bag-of-visual-words, local binary patterns, histogram of gradients and so on (Kluckner et al., 2009, Hoberg et al., 2015, Zhong and Wang, 2007,

Deep parsing of aerial images

Our model is composed of three main ingredients: a multi-task CNN providing class-likelihoods and probabilities of boundaries (Section 3.1), a segmentation tree (Section 3.2) and a CRF model encoding information about spatial dependency of the labeling (Section 3.3).

Vaihingen benchmark

The Vaihingen dataset is a dataset provided by the International Society for Photogrammetry and Remote Sensing (ISPRS), working group II/4, in the framework of a “2D semantic labeling contest” benchmark.1

The dataset is composed of 33 orthorectified image tiles acquired by a near infrared (NIR) - green (G) - red (R) aerial camera, over the town of Vaihingen (Germany). Images are accompanied by a digital surface model (DSM)

Baselines

We compare the proposed segmentation pipeline to different baselines. The first, named Unary PX, evaluates the segmentation accuracy as given by the pixelwise prediction from the CNN. The second is named Unary SP and reports the accuracy of superpixels labeling. Pixel-based likelihoods are averaged for each superpixel and the maximum-a posteriori for each region provides the final labels. Note that superpixels represent the lowest level of the segmentation tree and these are produced by

Discussion

The basic Unary PX and Unary SP perform really well in terms of AA. The reason is that they do not tend to oversmooth predictions and, even when Unary SP does, it oversmooths only locally in the extent of the superpixels. Since the shape of superpixels directly follows semantic boundaries in the images, average class-likelihoods in regions tend to be positively correlated with the actual label, filtering out spurious noise. Overmoothing concerns mostly spatially small classes, where CRFs tend

Conclusions and future perspectives

We proposed a model fusing bottom-up hierarchical evidence about local appearance and top-down prior information about local spatial organization and pairwise relationships between superpixels.

Regarding the first aspect, we showcased the possibility of learning the tasks of dense semantic segmentation and semantic boundaries jointly, in an end-to-end way, using a modified pretrained network. To do so, we extended a CNN formulation for multi-task learning, relying on the hypercolumn

Acknowledgments

This work was partly supported by the Swiss National Science Foundation, grant 150593 “Multimodal machine learning for remote sensing information fusion” (http://p3.snf.ch/project-150593). The authors would also like to thank the Belgian Royal Military Academy, for acquiring and providing the Zeebrugges data used in this study, ONERA (The French Aerospace Lab), for providing the corresponding ground-truth data Lagrange et al. (2015), and the IEEE GRSS Image Analysis and Data Fusion Technical

References (47)

  • Dollar, P., Zitnick, C., 2013. Structured forests for fast edge detection. In: International Conference on Computer...
  • P. Felzenszwalb et al.

    Efficient graph-based image segmentation

    Int. J. Comp. Vis.

    (2004)
  • M. Gerke

    Use of the Stair Vision Library Within the ISPRS 2D Semantic Labeling Benchmark (Vaihingen)

    (2015)
  • M.G. Giménez et al.

    Determination of grassland use intensity based on multi-temporal remote sensing data and ecological indicators

    Rem. Sens. Environ.

    (2017)
  • M. Golipour et al.

    Integrating hierarchical segmentation maps with MRF prior for classification of hyperspectral images in a bayesian framework

    IEEE Trans. Geosci. Rem. Sens.

    (2016)
  • S. Gould et al.

    Multi-class segmentation with relative location prior

    Int. J. Comp. Vis.

    (2008)
  • Hariharan, B., Arbelaez, P., Bourdev, L., Maji, S., Malik, J., 2011. Semantic contours from inverse detectors. In:...
  • Hariharan, B., Arbeláez, P., Girshick, R., Malik, J., 2015. Hypercolumns for object segmentation and fine-grained...
  • I. Hedhli et al.

    A new cascade model for the hierarchical joint classification of multitemporal and multiresolution remote sensing data

    IEEE Trans. Geosci. Rem. Sens.

    (2016)
  • T. Hoberg et al.

    Conditional random fields for multitemporal and multiscale classification of optical satellite imagery

    IEEE Trans. Geosci. Rem. Sens.

    (2015)
  • J. Höhle

    Generating topographic map data from classification results

    Rem. Sens.

    (2017)
  • Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q., 2017. Densely connected convolutional networks. In: IEEE...
  • Kluckner, S., Mauthner, T., Roth, P.M., Bischof, H., 2009. Semantic classification in aerial imagery by integrating...
  • Cited by (82)

    • Ten deep learning techniques to address small data problems with remote sensing

      2023, International Journal of Applied Earth Observation and Geoinformation
    • Delineation of agricultural fields using multi-task BsiNet from high-resolution satellite images

      2022, International Journal of Applied Earth Observation and Geoinformation
    • Self-localization based on terrestrial and satellite semantics

      2022, Engineering Applications of Artificial Intelligence
    • Robust transfer learning based on Geometric Mean Metric Learning

      2021, Knowledge-Based Systems
      Citation Excerpt :

      However, in real-world applications, the training data and the test data do not always follow the same distribution owing to various factors such as illumination, viewpoint, data acquisition equipment, and so on, which will degrade the performance of the learned recognition model for the test data. Therefore, transfer learning becomes a very important topic [1,2], which has been widely studied in many fields such as image analysis [3–5], sentiment analysis [6–8],recommendation system [9–11], and so on. In recent years, the integration of fuzzy logic with transfer learning has attracted more attentions.

    View all citing articles on Scopus
    View full text