Deep multi-task learning for a geographically-regularized semantic segmentation of aerial images

doi:10.1016/j.isprsjprs.2018.06.007

ISPRS Journal of Photogrammetry and Remote Sensing

Volume 144, October 2018, Pages 48-60

https://doi.org/10.1016/j.isprsjprs.2018.06.007 Get rights and content

Abstract

When approaching the semantic segmentation of overhead imagery in the decimeter spatial resolution range, successful strategies usually combine powerful methods to learn the visual appearance of the semantic classes (e.g. convolutional neural networks) with strategies for spatial regularization (e.g. graphical models such as conditional random fields).

In this paper, we propose a method to learn evidence in the form of semantic class likelihoods, semantic boundaries across classes and shallow-to-deep visual features, each one modeled by a multi-task convolutional neural network architecture. We combine this bottom-up information with top-down spatial regularization encoded by a conditional random field model optimizing the label space across a hierarchy of segments with constraints related to structural, spatial and data-dependent pairwise relationships between regions.

Our results show that such strategy provide better regularization than a series of strong baselines reflecting state-of-the-art technologies. The proposed strategy offers a flexible and principled framework to include several sources of visual and structural information, while allowing for different degrees of spatial regularization accounting for priors about the expected output structures.

Introduction

This paper deals with parsing decimeter resolution abovehead images into semantic classes, relating to land cover and/or land use types. We will refer to this process as semantic segmentation. For a successful segmentation, one requires visual models able to disambiguate local appearance by understanding the spatial organization of semantic classes (Gould et al., 2008). To this end, machine learning models need to exploit different levels of spatial continuity in the image space (Campbell et al., 1997, Shotton et al., 2006). Accurate land cover and land use mapping is an active research field, growing in parallel to developments in sensors and acquisition systems and to data processing algorithms. Applications ranging from environmental monitoring (Asner et al., 2005, Giménez et al., 2017) to urban studies (Zhong and Wang, 2007, Jat et al., 2008) benefit from advances in processing and interpretation of abovehead data.

Semantic segmentation of sub-decimeter aerial imagery is often tackled by Markov and conditional random fields (MRF, CRF) (Besag, 1974, Lafferty et al., 2001) combining local visual cues (the unary potentials) and interaction between nearby spatial units (the pairwise potentials) (Kluckner et al., 2009, Hoberg et al., 2015, Zhong and Wang, 2007, Shotton et al., 2006, Volpi and Ferrari, 2015). By maximizing the posterior joint probability of a CRF over the labeling (i.e. minimizing a Gibbs energy), one retrieves the most probable labeling of a given scene, i.e. the most probable configuration of local label assignments over the whole image space. These frameworks allow to model jointly bottom-up evidence, encoded in the unary potentials, together with some domain specific prior information encoded in the spatial interaction pairwise terms.

The idea behind the proposed model is that, when dealing with urban imagery (and in general decimeter resolution imagery), both the content of the image and the classes are highly structured in the spatial domain, calling for data- and domain-specific regularization. To follow such intuition, we model two key aspects of spatial dependencies: input and output space interactions. The former are usually encoded by operators accounting for the spatial autocorrelation of pixels in their spatial domain. The latter are encoded by different kinds of pairwise potential, favoring specific configurations issued from a predefined prior distribution.

–
To extract information about local input relations, we combine state-of-the-art convolutional neural networks (CNN, LeCun et al., 1998, Simonyan and Zisserman, 2015, Krizhevsky et al., 2012) providing data-driven cues for multiple tasks: We employ a CNN to not only provide approximate class-likelihoods, but also to predict semantic boundaries between the different classes. The latter coincide usually with natural edges in the image, but also corresponding to changes in labeling. Then, we build a segmentation tree using the semantic boundaries predicted by the CNN. Such tree represents hierarchy of regions spanning from the lowest level defined by groups of pixels (or superpixels) to the highest level, the whole scene. The region partitioning depends jointly on shallow-to-deep visual features and the semantic boundaries learned by the multi-task CNN.
–
To account for the output relations between regions, we combine the information within each region in a hierarchy using a top-down graphical model including different key aspects of the spatial organization of labels, given the observed inputs. This second modeling step is based on a CRF that aims at reducing the complexity (i.e. regularizing) of the pixel-wise maps, by semantically and spatially parsing consistent regions of the image, likely to belong to given classes, at different scales. Specifically, the CRF model takes into account evidence from the CNN (class-likelihoods, learned visual features and presence of class-specific boundaries) and spatial interactions (label smoothness, label co-occurrence, region distances, elevation gradient) within the hierarchy. In other words, it learns the extent and the labeling of each segment simultaneously, by minimizing a specifically designed energy.

A visual summary of the proposed pipeline is presented in Fig. 1.

We evaluate all the components of the system and show that spatial regularization is indeed useful in simplifying class structures spatially, while achieving accurate results. Since spatial structures are learned and encoded directly in the output map, we believe our pipeline is a step towards systems yet based on machine learning, but not requiring extensive manual post-processing (e.g. local class filtering, spatial corrections, map generalization, fusion and vectorization Crommelinck et al., 2016, Höhle, 2017), at the same time employing domain knowledge and data specific regularization, tailoring it to specific application domain and softening black-box effects. Specifically, the contributions of this paper are:

–
A detailed explanation on our multi-task CNN, building on top of a pretrained network (VGG).
–
A strategy to transform semantic boundaries probabilities to superpixels and hierarchical regions.
–
A CRF encoding the desired space-scale relationships between segments.
–
The combination of different energy terms accounting for multiple input-output relationships, combining bottom-up (outputs and features of the CNN) and top-down (multi-modal clues about spatial arrangement) into local and pairwise relationships.

In the next section, we summarize some relevant related works. In Section 3, we present the proposed system: the multi-task CNN architecture (Section 3.1), the hierarchical representation of image regions (Section 3.2) and the CRF model (Section 3.3). We present data and experimental setup in Section 4 and the results obtained in Section 5. We finally provide a discussion about our system in Section 6, leading to conclusions presented in Section 7.

Section snippets

Mid-level representations

To generate powerful visual models, traditional methods compute local appearance models mapping locals descriptors to labels, over a dense grid covering the image space. Then, the relationships between output variables are usually modeled by MRF and CRF. Standard approaches to local image descriptors involve the use of local color statistics, texture, bag-of-visual-words, local binary patterns, histogram of gradients and so on (Kluckner et al., 2009, Hoberg et al., 2015, Zhong and Wang, 2007,

Deep parsing of aerial images

Our model is composed of three main ingredients: a multi-task CNN providing class-likelihoods and probabilities of boundaries (Section 3.1), a segmentation tree (Section 3.2) and a CRF model encoding information about spatial dependency of the labeling (Section 3.3).

Vaihingen benchmark

The Vaihingen dataset is a dataset provided by the International Society for Photogrammetry and Remote Sensing (ISPRS), working group II/4, in the framework of a “2D semantic labeling contest” benchmark.¹

The dataset is composed of 33 orthorectified image tiles acquired by a near infrared (NIR) - green (G) - red (R) aerial camera, over the town of Vaihingen (Germany). Images are accompanied by a digital surface model (DSM)

Baselines

We compare the proposed segmentation pipeline to different baselines. The first, named Unary PX, evaluates the segmentation accuracy as given by the pixelwise prediction from the CNN. The second is named Unary SP and reports the accuracy of superpixels labeling. Pixel-based likelihoods are averaged for each superpixel and the maximum-a posteriori for each region provides the final labels. Note that superpixels represent the lowest level of the segmentation tree and these are produced by

Discussion

The basic Unary PX and Unary SP perform really well in terms of AA. The reason is that they do not tend to oversmooth predictions and, even when Unary SP does, it oversmooths only locally in the extent of the superpixels. Since the shape of superpixels directly follows semantic boundaries in the images, average class-likelihoods in regions tend to be positively correlated with the actual label, filtering out spurious noise. Overmoothing concerns mostly spatially small classes, where CRFs tend

Conclusions and future perspectives

We proposed a model fusing bottom-up hierarchical evidence about local appearance and top-down prior information about local spatial organization and pairwise relationships between superpixels.

Regarding the first aspect, we showcased the possibility of learning the tasks of dense semantic segmentation and semantic boundaries jointly, in an end-to-end way, using a modified pretrained network. To do so, we extended a CNN formulation for multi-task learning, relying on the hypercolumn

Acknowledgments

This work was partly supported by the Swiss National Science Foundation, grant 150593 “Multimodal machine learning for remote sensing information fusion” (http://p3.snf.ch/project-150593). The authors would also like to thank the Belgian Royal Military Academy, for acquiring and providing the Zeebrugges data used in this study, ONERA (The French Aerospace Lab), for providing the corresponding ground-truth data Lagrange et al. (2015), and the IEEE GRSS Image Analysis and Data Fusion Technical

References (47)

N. Audebert et al.
Beyond RGB: very high resolution urban remote sensing with multimodal deep networks
ISPRS J. Photogram. Rem. Sens.
(2018)
M.K. Jat et al.
Monitoring and modelling of urban sprawl using remote sensing and GIS techniques
Int. J. Appl. Earth Observ. Geoinf.
(2008)
D. Marmanis et al.
Classification with an edge: improving semantic image segmentation with boundary detection
ISPRS J. Photogram. Rem. Sens.
(2018)
P. Arbelaez et al.
Contour detection and hierarchical image segmentation
IEEE TPAMI
(2011)
G.P. Asner et al.
Selective logging in the brazilian amazon
Science
(2005)
J. Besag
Spatial interaction and the statistical analysis of lattice systems
J. R. Statist. Soc. Ser. B
(1974)
Y. Boykov et al.
Fast approximate energy minimization via graph cuts
IEEE Trans. Pattern Anal. Mach. Intell.
(2001)
M.W. Campbell et al.
Interpreting image databases by region classification
Pattern Recog.
(1997)
M. Campos-Taberner et al.
Processing of extremely high resolution LiDAR and RGB data: outcome of the 2015 IEEE GRSS data fusion contest. Part A: 2D contest
IEEE J. Sel. Topics Appl. Earth Observ. Rem. Sens.
(2016)
S. Crommelinck et al.
Review of automatic feature extraction from high-resolution optical sensor data for uav-based cadastral mapping
Rem. Sens.
(2016)

Dollar, P., Zitnick, C., 2013. Structured forests for fast edge detection. In: International Conference on Computer...

P. Felzenszwalb et al.

Efficient graph-based image segmentation

Int. J. Comp. Vis.

(2004)

M. Gerke

Use of the Stair Vision Library Within the ISPRS 2D Semantic Labeling Benchmark (Vaihingen)

(2015)

M.G. Giménez et al.

Determination of grassland use intensity based on multi-temporal remote sensing data and ecological indicators

Rem. Sens. Environ.

(2017)

M. Golipour et al.

Integrating hierarchical segmentation maps with MRF prior for classification of hyperspectral images in a bayesian framework

IEEE Trans. Geosci. Rem. Sens.

(2016)

S. Gould et al.

Multi-class segmentation with relative location prior

Int. J. Comp. Vis.

(2008)

Hariharan, B., Arbelaez, P., Bourdev, L., Maji, S., Malik, J., 2011. Semantic contours from inverse detectors. In:...

Hariharan, B., Arbeláez, P., Girshick, R., Malik, J., 2015. Hypercolumns for object segmentation and fine-grained...

I. Hedhli et al.

A new cascade model for the hierarchical joint classification of multitemporal and multiresolution remote sensing data

IEEE Trans. Geosci. Rem. Sens.

(2016)

T. Hoberg et al.

Conditional random fields for multitemporal and multiscale classification of optical satellite imagery

IEEE Trans. Geosci. Rem. Sens.

(2015)

J. Höhle

Generating topographic map data from classification results

Rem. Sens.

(2017)

Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q., 2017. Densely connected convolutional networks. In: IEEE...

Kluckner, S., Mauthner, T., Roth, P.M., Bischof, H., 2009. Semantic classification in aerial imagery by integrating...

Cited by (82)

Ten deep learning techniques to address small data problems with remote sensing
2023, International Journal of Applied Earth Observation and Geoinformation
Researchers and engineers have increasingly used Deep Learning (DL) for a variety of Remote Sensing (RS) tasks. However, data from local observations or via ground truth is often quite limited for training DL models, especially when these models represent key socio-environmental problems, such as the monitoring of extreme, destructive climate events, biodiversity, and sudden changes in ecosystem states. Such cases, also known as small data problems, pose significant methodological challenges. This review summarises these challenges in the RS domain and the possibility of using emerging DL techniques to overcome them. We show that the small data problem is a common challenge across disciplines and scales that results in poor model generalisability and transferability. We then introduce an overview of ten promising DL techniques: transfer learning, self-supervised learning, semi-supervised learning, few-shot learning, zero-shot learning, active learning, weakly supervised learning, multitask learning, process-aware learning, and ensemble learning; we also include a validation technique known as spatial k-fold cross validation. Our particular contribution was to develop a flowchart that helps DL users select which technique to use given by answering a few questions. We hope that our review article facilitate DL applications to tackle societally important environmental problems with limited reference data.
DASFNet: Dense-Attention–Similarity-Fusion Network for scene classification of dual-modal remote-sensing images
2022, International Journal of Applied Earth Observation and Geoinformation
Although significant progress has been made in scene classification of high-resolution remote-sensing images (HRRSIs), dual-modal HRRSI scene classification is still an active and challenging issue. In this study, we introduce an end-to-end dense-attention–similarity-fusion network (DASFNet) for dual-modal HRRSIs. Specifically, we propose a dense-attention map module based on graph convolution, which adaptively captures long-range semantic cues and further directs shallow-attention cues to the deep level to guide the generation of high-level feature attention cues. At the encoder stage, DASFNet uses feature similarity to explore the correlation between dual-modal features; a similarity-fusion module extracts complementary information by fusing features from different modalities. A multiscale context-feature-aggregation module is used to strengthen the feature embedding of any two spatial scales; this solves the of scale change problem. A large number of experiments on two HRRSI benchmark datasets for scene classification indicate that the proposed DASFNet outperforms the outstanding scene classification approaches.
Delineation of agricultural fields using multi-task BsiNet from high-resolution satellite images
2022, International Journal of Applied Earth Observation and Geoinformation
This paper presents a new multi-task neural network, called BsiNet, to delineate agricultural fields from high-resolution satellite images. BsiNet is modified from a Psi-Net by structuring three parallel decoders into a single encoder to improve computational efficiency. BsiNet learns three tasks: a core task for agricultural field identification and two auxiliary tasks for field boundary prediction and distance estimation, corresponding to mask, boundary, and distance tasks, respectively. A spatial group-wise enhancement module is incorporated to improve the identification of small fields. We conducted experiments on a GaoFen1 and three GaoFen2 satellite images collected in Xinjiang, Fujian, Shandong, and Sichuan provinces in China, and compared BsiNet with 13 different neural networks. Our results show that the agricultural fields extracted by BsiNet have the lowest global over-classification (GOC) of 0.062, global under-classification (GUC) of 0.042, and global total errors (GTC) of 0.062 for the Xinjiang dataset. For the Fujian dataset with irregular and complex fields, BsiNet outperformed the second-best method from the Xinjiang dataset analysis, yielding the lowest GTC of 0.291. It also produced satisfactory results on the Shandong and Sichuan datasets. Moreover, BsiNet has fewer parameters and faster computation than existing multi-task models (i.e., Psi-Net and ResUNet-a D7). We conclude that BsiNet can be used successfully in extracting agricultural fields from high-resolution satellite images and can be applied to different field settings.
Self-localization based on terrestrial and satellite semantics
2022, Engineering Applications of Artificial Intelligence
Owing to its vast applicability, the semantic interpretation of regions or entities is increasingly attracting the attention of scholars in the robotics community. Recent research in robot vision has equipped, modern autonomous systems with the ability to semantically recognize and segment entities from scenes with the aim to effectively interpret the environment. Extending this notion, the semantic representation of the surroundings is considered to be a fundamental property for robot self-localization, especially in the absence of any georeferencing signal. In this paper, we present a robust algorithm to locate the position of an autonomous agent within a georeferenced map through particle filtering. Specifically, the proposed approach consists of (i) a motion model of metric data from visual odometry, (ii) an observation model of graph-based descriptors with semantic and metric information and (iii) a re-sampling model, based on the stochastic universal sampling. The above components are evaluated under an extensive set of experiments revealing the robustness and accuracy of our final self-localization system.
Monitoring leaf phenology in moist tropical forests by applying a superpixel-based deep learning method to time-series images of tree canopies
2022, ISPRS Journal of Photogrammetry and Remote Sensing
Tropical leaf phenology—particularly its variability at the tree-crown scale—dominates the seasonality of carbon and water fluxes. However, given enormous species diversity, accurate means of monitoring leaf phenology in tropical forests is still lacking. Time series of the Green Chromatic Coordinate (GCC) metric derived from tower-based red–greenblue (RGB) phenocams have been widely used to monitor leaf phenology in temperate forests, but its application in the tropics remains problematic. To improve monitoring of tropical phenology, we explored the use of a deep learning model (i.e. superpixel-based Residual Networks 50, SP-ResNet50) to automatically differentiate leaves from non-leaves in phenocam images and to derive leaf fraction at the tree-crown scale. To evaluate our model, we used a year of data from six phenocams in two contrasting forests in Panama. We first built a comprehensive library of leaf and non-leaf pixels across various acquisition times, exposure conditions and specific phenocams. We then divided this library into training and testing components. We evaluated the model at three levels: 1) superpixel level with a testing set, 2) crown level by comparing the model-derived leaf fractions with those derived using image-specific supervised classification, and 3) temporally using all daily images to assess the diurnal stability of the model-derived leaf fraction. Finally, we compared the model-derived leaf fraction phenology with leaf phenology derived from GCC. Our results show that: 1) the SP-ResNet50 model accurately differentiates leaves from non-leaves (overall accuracy of 93%) and is robust across all three levels of evaluations; 2) the model accurately quantifies leaf fraction phenology across tree-crowns and forest ecosystems; and 3) the combined use of leaf fraction and GCC helps infer the timing of leaf emergence, maturation and senescence, critical information for modeling photosynthetic seasonality of tropical forests. Collectively, this study offers an improved means for automated tropical phenology monitoring using phenocams.
Robust transfer learning based on Geometric Mean Metric Learning
2021, Knowledge-Based Systems
Citation Excerpt :
However, in real-world applications, the training data and the test data do not always follow the same distribution owing to various factors such as illumination, viewpoint, data acquisition equipment, and so on, which will degrade the performance of the learned recognition model for the test data. Therefore, transfer learning becomes a very important topic [1,2], which has been widely studied in many fields such as image analysis [3–5], sentiment analysis [6–8],recommendation system [9–11], and so on. In recent years, the integration of fuzzy logic with transfer learning has attracted more attentions.
Transfer learning usually utilizes the knowledge learned from the relative labeled source domain to promote the model performance in the unlabeled or few labeled target domain with different distribution. Most of the existing transfer learning methods aim to reduce the discrepancy of distributions between the source and target domains, but ignore the discriminative category information involved in the data from both domains in the process of knowledge transfer. To learn more discriminative feature representation in knowledge transfer, this paper integrates the transfer learning and metric learning into a unified framework and proposes a novel robust transfer learning based on geometric mean metric learning, namely Geometric Mean Transfer Learning (GMTL). GMTL uses weighted geometric mean metric learning to model the intra-class distance and the inter-class similarity. In the meantime, the marginal distributions and conditional distributions of the source and target domains are jointly adapted. Moreover, according to the natures of the datasets in different tasks, we dynamically combine the discriminative modeling and domain adaption to make the proposed model more robust. We assign different weights to the intra-class distance and the inter-class similarity in metric learning and different weights to marginal distribution adaption and conditional distribution adaption, respectively. Finally, the solution to the objective function is converted to the problem of finding a point on the geodesic joining two points on the Riemannian manifold, which is very simple and direct. Extensive experiments are conducted on six datasets widely adopted in transfer learning to verify the superiority of our proposed GMTL over existing state-of-the-art transfer learning methods.

View all citing articles on Scopus

View full text

Deep multi-task learning for a geographically-regularized semantic segmentation of aerial images

Abstract

Introduction

Section snippets

Mid-level representations

Deep parsing of aerial images

Vaihingen benchmark

Baselines

Discussion

Conclusions and future perspectives

Acknowledgments

ISPRS J. Photogram. Rem. Sens.

Int. J. Appl. Earth Observ. Geoinf.

ISPRS J. Photogram. Rem. Sens.

Contour detection and hierarchical image segmentation

IEEE TPAMI

Selective logging in the brazilian amazon

Science

Spatial interaction and the statistical analysis of lattice systems

J. R. Statist. Soc. Ser. B

Fast approximate energy minimization via graph cuts

IEEE Trans. Pattern Anal. Mach. Intell.

Interpreting image databases by region classification

Pattern Recog.

Processing of extremely high resolution LiDAR and RGB data: outcome of the 2015 IEEE GRSS data fusion contest. Part A: 2D contest

IEEE J. Sel. Topics Appl. Earth Observ. Rem. Sens.

Review of automatic feature extraction from high-resolution optical sensor data for uav-based cadastral mapping

Rem. Sens.

Efficient graph-based image segmentation

Int. J. Comp. Vis.

Use of the Stair Vision Library Within the ISPRS 2D Semantic Labeling Benchmark (Vaihingen)

Determination of grassland use intensity based on multi-temporal remote sensing data and ecological indicators

Rem. Sens. Environ.

Integrating hierarchical segmentation maps with MRF prior for classification of hyperspectral images in a bayesian framework

IEEE Trans. Geosci. Rem. Sens.

Multi-class segmentation with relative location prior

Int. J. Comp. Vis.

A new cascade model for the hierarchical joint classification of multitemporal and multiresolution remote sensing data

IEEE Trans. Geosci. Rem. Sens.

Conditional random fields for multitemporal and multiscale classification of optical satellite imagery

IEEE Trans. Geosci. Rem. Sens.

Generating topographic map data from classification results

Rem. Sens.