Deep multi-task learning for a geographically-regularized semantic segmentation of aerial images
Introduction
This paper deals with parsing decimeter resolution abovehead images into semantic classes, relating to land cover and/or land use types. We will refer to this process as semantic segmentation. For a successful segmentation, one requires visual models able to disambiguate local appearance by understanding the spatial organization of semantic classes (Gould et al., 2008). To this end, machine learning models need to exploit different levels of spatial continuity in the image space (Campbell et al., 1997, Shotton et al., 2006). Accurate land cover and land use mapping is an active research field, growing in parallel to developments in sensors and acquisition systems and to data processing algorithms. Applications ranging from environmental monitoring (Asner et al., 2005, Giménez et al., 2017) to urban studies (Zhong and Wang, 2007, Jat et al., 2008) benefit from advances in processing and interpretation of abovehead data.
Semantic segmentation of sub-decimeter aerial imagery is often tackled by Markov and conditional random fields (MRF, CRF) (Besag, 1974, Lafferty et al., 2001) combining local visual cues (the unary potentials) and interaction between nearby spatial units (the pairwise potentials) (Kluckner et al., 2009, Hoberg et al., 2015, Zhong and Wang, 2007, Shotton et al., 2006, Volpi and Ferrari, 2015). By maximizing the posterior joint probability of a CRF over the labeling (i.e. minimizing a Gibbs energy), one retrieves the most probable labeling of a given scene, i.e. the most probable configuration of local label assignments over the whole image space. These frameworks allow to model jointly bottom-up evidence, encoded in the unary potentials, together with some domain specific prior information encoded in the spatial interaction pairwise terms.
The idea behind the proposed model is that, when dealing with urban imagery (and in general decimeter resolution imagery), both the content of the image and the classes are highly structured in the spatial domain, calling for data- and domain-specific regularization. To follow such intuition, we model two key aspects of spatial dependencies: input and output space interactions. The former are usually encoded by operators accounting for the spatial autocorrelation of pixels in their spatial domain. The latter are encoded by different kinds of pairwise potential, favoring specific configurations issued from a predefined prior distribution.
- –
To extract information about local input relations, we combine state-of-the-art convolutional neural networks (CNN, LeCun et al., 1998, Simonyan and Zisserman, 2015, Krizhevsky et al., 2012) providing data-driven cues for multiple tasks: We employ a CNN to not only provide approximate class-likelihoods, but also to predict semantic boundaries between the different classes. The latter coincide usually with natural edges in the image, but also corresponding to changes in labeling. Then, we build a segmentation tree using the semantic boundaries predicted by the CNN. Such tree represents hierarchy of regions spanning from the lowest level defined by groups of pixels (or superpixels) to the highest level, the whole scene. The region partitioning depends jointly on shallow-to-deep visual features and the semantic boundaries learned by the multi-task CNN.
- –
To account for the output relations between regions, we combine the information within each region in a hierarchy using a top-down graphical model including different key aspects of the spatial organization of labels, given the observed inputs. This second modeling step is based on a CRF that aims at reducing the complexity (i.e. regularizing) of the pixel-wise maps, by semantically and spatially parsing consistent regions of the image, likely to belong to given classes, at different scales. Specifically, the CRF model takes into account evidence from the CNN (class-likelihoods, learned visual features and presence of class-specific boundaries) and spatial interactions (label smoothness, label co-occurrence, region distances, elevation gradient) within the hierarchy. In other words, it learns the extent and the labeling of each segment simultaneously, by minimizing a specifically designed energy.
A visual summary of the proposed pipeline is presented in Fig. 1.
We evaluate all the components of the system and show that spatial regularization is indeed useful in simplifying class structures spatially, while achieving accurate results. Since spatial structures are learned and encoded directly in the output map, we believe our pipeline is a step towards systems yet based on machine learning, but not requiring extensive manual post-processing (e.g. local class filtering, spatial corrections, map generalization, fusion and vectorization Crommelinck et al., 2016, Höhle, 2017), at the same time employing domain knowledge and data specific regularization, tailoring it to specific application domain and softening black-box effects. Specifically, the contributions of this paper are:
- –
A detailed explanation on our multi-task CNN, building on top of a pretrained network (VGG).
- –
A strategy to transform semantic boundaries probabilities to superpixels and hierarchical regions.
- –
A CRF encoding the desired space-scale relationships between segments.
- –
The combination of different energy terms accounting for multiple input-output relationships, combining bottom-up (outputs and features of the CNN) and top-down (multi-modal clues about spatial arrangement) into local and pairwise relationships.
In the next section, we summarize some relevant related works. In Section 3, we present the proposed system: the multi-task CNN architecture (Section 3.1), the hierarchical representation of image regions (Section 3.2) and the CRF model (Section 3.3). We present data and experimental setup in Section 4 and the results obtained in Section 5. We finally provide a discussion about our system in Section 6, leading to conclusions presented in Section 7.
Section snippets
Mid-level representations
To generate powerful visual models, traditional methods compute local appearance models mapping locals descriptors to labels, over a dense grid covering the image space. Then, the relationships between output variables are usually modeled by MRF and CRF. Standard approaches to local image descriptors involve the use of local color statistics, texture, bag-of-visual-words, local binary patterns, histogram of gradients and so on (Kluckner et al., 2009, Hoberg et al., 2015, Zhong and Wang, 2007,
Deep parsing of aerial images
Our model is composed of three main ingredients: a multi-task CNN providing class-likelihoods and probabilities of boundaries (Section 3.1), a segmentation tree (Section 3.2) and a CRF model encoding information about spatial dependency of the labeling (Section 3.3).
Vaihingen benchmark
The Vaihingen dataset is a dataset provided by the International Society for Photogrammetry and Remote Sensing (ISPRS), working group II/4, in the framework of a “2D semantic labeling contest” benchmark.1
The dataset is composed of 33 orthorectified image tiles acquired by a near infrared (NIR) - green (G) - red (R) aerial camera, over the town of Vaihingen (Germany). Images are accompanied by a digital surface model (DSM)
Baselines
We compare the proposed segmentation pipeline to different baselines. The first, named Unary PX, evaluates the segmentation accuracy as given by the pixelwise prediction from the CNN. The second is named Unary SP and reports the accuracy of superpixels labeling. Pixel-based likelihoods are averaged for each superpixel and the maximum-a posteriori for each region provides the final labels. Note that superpixels represent the lowest level of the segmentation tree and these are produced by
Discussion
The basic Unary PX and Unary SP perform really well in terms of AA. The reason is that they do not tend to oversmooth predictions and, even when Unary SP does, it oversmooths only locally in the extent of the superpixels. Since the shape of superpixels directly follows semantic boundaries in the images, average class-likelihoods in regions tend to be positively correlated with the actual label, filtering out spurious noise. Overmoothing concerns mostly spatially small classes, where CRFs tend
Conclusions and future perspectives
We proposed a model fusing bottom-up hierarchical evidence about local appearance and top-down prior information about local spatial organization and pairwise relationships between superpixels.
Regarding the first aspect, we showcased the possibility of learning the tasks of dense semantic segmentation and semantic boundaries jointly, in an end-to-end way, using a modified pretrained network. To do so, we extended a CNN formulation for multi-task learning, relying on the hypercolumn
Acknowledgments
This work was partly supported by the Swiss National Science Foundation, grant 150593 “Multimodal machine learning for remote sensing information fusion” (http://p3.snf.ch/project-150593). The authors would also like to thank the Belgian Royal Military Academy, for acquiring and providing the Zeebrugges data used in this study, ONERA (The French Aerospace Lab), for providing the corresponding ground-truth data Lagrange et al. (2015), and the IEEE GRSS Image Analysis and Data Fusion Technical
References (47)
- et al.
Beyond RGB: very high resolution urban remote sensing with multimodal deep networks
ISPRS J. Photogram. Rem. Sens.
(2018) - et al.
Monitoring and modelling of urban sprawl using remote sensing and GIS techniques
Int. J. Appl. Earth Observ. Geoinf.
(2008) - et al.
Classification with an edge: improving semantic image segmentation with boundary detection
ISPRS J. Photogram. Rem. Sens.
(2018) - et al.
Contour detection and hierarchical image segmentation
IEEE TPAMI
(2011) - et al.
Selective logging in the brazilian amazon
Science
(2005) Spatial interaction and the statistical analysis of lattice systems
J. R. Statist. Soc. Ser. B
(1974)- et al.
Fast approximate energy minimization via graph cuts
IEEE Trans. Pattern Anal. Mach. Intell.
(2001) - et al.
Interpreting image databases by region classification
Pattern Recog.
(1997) - et al.
Processing of extremely high resolution LiDAR and RGB data: outcome of the 2015 IEEE GRSS data fusion contest. Part A: 2D contest
IEEE J. Sel. Topics Appl. Earth Observ. Rem. Sens.
(2016) - et al.
Review of automatic feature extraction from high-resolution optical sensor data for uav-based cadastral mapping
Rem. Sens.
(2016)
Efficient graph-based image segmentation
Int. J. Comp. Vis.
Use of the Stair Vision Library Within the ISPRS 2D Semantic Labeling Benchmark (Vaihingen)
Determination of grassland use intensity based on multi-temporal remote sensing data and ecological indicators
Rem. Sens. Environ.
Integrating hierarchical segmentation maps with MRF prior for classification of hyperspectral images in a bayesian framework
IEEE Trans. Geosci. Rem. Sens.
Multi-class segmentation with relative location prior
Int. J. Comp. Vis.
A new cascade model for the hierarchical joint classification of multitemporal and multiresolution remote sensing data
IEEE Trans. Geosci. Rem. Sens.
Conditional random fields for multitemporal and multiscale classification of optical satellite imagery
IEEE Trans. Geosci. Rem. Sens.
Generating topographic map data from classification results
Rem. Sens.
Cited by (82)
Ten deep learning techniques to address small data problems with remote sensing
2023, International Journal of Applied Earth Observation and GeoinformationDASFNet: Dense-Attention–Similarity-Fusion Network for scene classification of dual-modal remote-sensing images
2022, International Journal of Applied Earth Observation and GeoinformationDelineation of agricultural fields using multi-task BsiNet from high-resolution satellite images
2022, International Journal of Applied Earth Observation and GeoinformationSelf-localization based on terrestrial and satellite semantics
2022, Engineering Applications of Artificial IntelligenceMonitoring leaf phenology in moist tropical forests by applying a superpixel-based deep learning method to time-series images of tree canopies
2022, ISPRS Journal of Photogrammetry and Remote SensingRobust transfer learning based on Geometric Mean Metric Learning
2021, Knowledge-Based SystemsCitation Excerpt :However, in real-world applications, the training data and the test data do not always follow the same distribution owing to various factors such as illumination, viewpoint, data acquisition equipment, and so on, which will degrade the performance of the learned recognition model for the test data. Therefore, transfer learning becomes a very important topic [1,2], which has been widely studied in many fields such as image analysis [3–5], sentiment analysis [6–8],recommendation system [9–11], and so on. In recent years, the integration of fuzzy logic with transfer learning has attracted more attentions.