Dense Semantic Labeling with Atrous Spatial Pyramid Pooling and Decoder for High-Resolution Remote Sensing Imagery

Dense semantic labeling is significant in high-resolution remote sensing imagery research and it has been widely used in land-use analysis and environment protection. With the recent success of fully convolutional networks (FCN), various types of network architectures have largely improved performance. Among them, atrous spatial pyramid pooling (ASPP) and encoder-decoder are two successful ones. The former structure is able to extract multi-scale contextual information and multiple effective field-of-view, while the latter structure can recover the spatial information to obtain sharper object boundaries. In this study, we propose a more efficient fully convolutional network by combining the advantages from both structures. Our model utilizes the deep residual network (ResNet) followed by ASPP as the encoder and combines two scales of high-level features with corresponding low-level features as the decoder at the upsampling stage. We further develop a multi-scale loss function to enhance the learning procedure. In the postprocessing, a novel superpixel-based dense conditional random field is employed to refine the predictions. We evaluate the proposed method on the Potsdam and Vaihingen datasets and the experimental results demonstrate that our method performs better than other machine learning or deep learning methods. Compared with the state-of-the-art DeepLab_v3+ our model gains 0.4% and 0.6% improvements in overall accuracy on these two datasets respectively.


Introduction
High-resolution remote sensing imagery captured by satellite or unmanned aerial vehicle (UAV) contains rich information and is significant in many applications, including land-use analysis, environment protection and urban planning [1]. Due to the rapid development of remote sensing technology, especially the improvement of imaging sensors, a massive number of high-quality images are available to be utilized [2]. With the support of sufficient data, dense semantic labeling, also known as semantic segmentation in computer vision, is now an essential aspect in research and is playing an increasingly critical role in many applications [3].
To better understand the scene, dense semantic labeling aims at segmenting the objects of given categories from the background of the images at the pixel-level, such as buildings, trees and cars [4]. In the past decades, a vast number of algorithms have been proposed. These algorithms can be divided a model employing both of them could further improve the performance. Figure 1 shows details of alternative network structures of dense semantic labeling. Inspired by the analysis above, we propose a novel architecture of the fully convolutional network that aims at not only detecting objects of different shapes but also restoring sharper object boundaries simultaneously. Our model adopts the deep residual network (ResNet) as the backbone, followed by atrous spatial pyramid pooling (ASPP) structure to extract multi-scale contextual information; these two parts constitute the encoder. Then we design the decoder structures by fusing two scales of low-level feature maps from ResNet with the corresponding predictions to restore the spatial information. To make this two structure fusion effective, we append a multi-scale softmax cross-entropy loss function with corresponding weights at the end of networks. Different from the loss function in Reference [29], our proposed loss function guides every scale of prediction during the training procedure which helps better optimize the parameters in the intermediate layers. After the networks, we improve the dense conditional random field (DenseCRF) using a superpixel algorithm in post-processing that gives an additional boost to performance. Experiments on the Potsdam and Vaihingen datasets demonstrate that our model outperformed other state-of-art networks and achieved 88.4% and 87.0% overall accuracy respectively with the calculation of the boundary pixels of objects. The main contributions of our study are listed as follows: 1. We propose a novel convolutional neural network that combines the advantages of ASPP and encoder-decoder structures. 2. We enhance the learning procedure by employing a multi-scale loss function. 3. We improve the dense conditional random field with a superpixel algorithm to optimize the prediction further. The remainder of this paper is organized as follows: Section 2 describes our dense semantic labeling system which includes the proposed model and the superpixel-based DenseCRF. Section 3 Inspired by the analysis above, we propose a novel architecture of the fully convolutional network that aims at not only detecting objects of different shapes but also restoring sharper object boundaries simultaneously. Our model adopts the deep residual network (ResNet) as the backbone, followed by atrous spatial pyramid pooling (ASPP) structure to extract multi-scale contextual information; these two parts constitute the encoder. Then we design the decoder structures by fusing two scales of low-level feature maps from ResNet with the corresponding predictions to restore the spatial information. To make this two structure fusion effective, we append a multi-scale softmax cross-entropy loss function with corresponding weights at the end of networks. Different from the loss function in Reference [29], our proposed loss function guides every scale of prediction during the training procedure which helps better optimize the parameters in the intermediate layers. After the networks, we improve the dense conditional random field (DenseCRF) using a superpixel algorithm in post-processing that gives an additional boost to performance. Experiments on the Potsdam and Vaihingen datasets demonstrate that our model outperformed other state-of-art networks and achieved 88.4% and 87.0% overall accuracy respectively with the calculation of the boundary pixels of objects. The main contributions of our study are listed as follows: We propose a novel convolutional neural network that combines the advantages of ASPP and encoder-decoder structures.

2.
We enhance the learning procedure by employing a multi-scale loss function.

3.
We improve the dense conditional random field with a superpixel algorithm to optimize the prediction further.
The remainder of this paper is organized as follows: Section 2 describes our dense semantic labeling system which includes the proposed model and the superpixel-based DenseCRF. Section 3 presents the datasets, preprocessing methods, training protocol and results. Section 4 is the discussion of our method and Section 5 concludes the whole study.

Methods
In this paper, a dense semantic labeling system to extract categorized objects from high-resolution remote sensing imagery is proposed. The system involves in the following stages. First, the imageries including red, green, blue (RGB), infrared radiation (IR) and normalized digital surface model (DSM) channels and groundtruth are sliced into small patches to generate the training and test data. Meanwhile, some data augmentation methods are employed to increase the complexity of data, such as flipping and rescaling the imageries randomly and so forth. Then, our proposed fully convolutional network is trained using the training data; the training procedure is based on the gradient descent algorithm that uses the updated parameters calculated by the loss function to improve the performance of the network. After that, the trained model with the best parameters will be chosen to generate predictions on the test data. Finally, we introduce a superpixel-based DenseCRF to optimize the predictions further. The pipeline of our dense semantic labeling system is illustrated in Figure 2.
Remote Sens. 2018, 10, x FOR PEER REVIEW  4 of 18 presents the datasets, preprocessing methods, training protocol and results. Section 4 is the discussion of our method and Section 5 concludes the whole study.

Methods
In this paper, a dense semantic labeling system to extract categorized objects from highresolution remote sensing imagery is proposed. The system involves in the following stages. First, the imageries including red, green, blue (RGB), infrared radiation (IR) and normalized digital surface model (DSM) channels and groundtruth are sliced into small patches to generate the training and test data. Meanwhile, some data augmentation methods are employed to increase the complexity of data, such as flipping and rescaling the imageries randomly and so forth. Then, our proposed fully convolutional network is trained using the training data; the training procedure is based on the gradient descent algorithm that uses the updated parameters calculated by the loss function to improve the performance of the network. After that, the trained model with the best parameters will be chosen to generate predictions on the test data. Finally, we introduce a superpixel-based DenseCRF to optimize the predictions further. The pipeline of our dense semantic labeling system is illustrated in Figure 2. The pipeline of our dense semantic labeling system, including data preprocessing, network training, testing and post-processing.

Encoder with ResNet and Atrous Spatial Pyramid Pooling
In this section, we introduce the encoder part of the proposed fully convolutional network. Our model adopts ResNet as the backbone, followed by the atrous spatial pyramid pooling structure. These two parts constitute the encoder to extract multiple scales of contextual information.

ResNet-101 as the Backbone
The backbone is the basic structure to extract features from input imageries in FCN-based models [30]. Nowadays, most works adopt classic classification networks such as VGG, ResNet without the fully connected parts. The reason is two-fold. First, these networks have excellent performance on ImageNet large scale visual recognition competition. Second, we can fine-tune our network with the pre-trained model. In this study, we choose ResNet as the backbone of our model. ResNet solved the vanishing-gradient problem [31] by employing the bottleneck unit and achieved better accuracy and smaller model size with deeper layers. Figure 3 shows details of ResNet and the bottleneck unit.

Encoder with ResNet and Atrous Spatial Pyramid Pooling
In this section, we introduce the encoder part of the proposed fully convolutional network. Our model adopts ResNet as the backbone, followed by the atrous spatial pyramid pooling structure. These two parts constitute the encoder to extract multiple scales of contextual information.

ResNet-101 as the Backbone
The backbone is the basic structure to extract features from input imageries in FCN-based models [30]. Nowadays, most works adopt classic classification networks such as VGG, ResNet without the fully connected parts. The reason is two-fold. First, these networks have excellent performance on ImageNet large scale visual recognition competition. Second, we can fine-tune our network with the pre-trained model. In this study, we choose ResNet as the backbone of our model. ResNet solved the vanishing-gradient problem [31] by employing the bottleneck unit and achieved better accuracy and smaller model size with deeper layers. Figure 3 shows details of ResNet and the bottleneck unit.

Atrous Spatial Pyramid Pooling
In this study, we utilize ASPP after ResNet to extract multi-scale contextual information further. ASPP is a parallel structure of several branches that operate to the same feature map and fuse the outputs in the end and it was first introduced in DeepLab_v2 network. ASPP employs the atrous convolution [32] in each branch. Different from standard convolution, atrous convolution has a rate parameter which adds the corresponding quantity of zeros between the parameters in the convolution filter. This operation equals the downsampling, convolution and upsampling process but has much better performance without increasing the number of parameters that maintains the efficiency of the network. The ASPP structure has two versions. The original one in DeepLab_v2 includes four branches of atrous convolution with rate 6, 12, 18, 24. But the convolution filter with rate 24 is close to the size of input feature maps, only the center of it takes effect. In DeepLab_v3 [33], it was replaced by a 1 x 1 convolution. Moreover, an image pooling branch is also appended to incorporate global context information. In our model, we employ the advanced ASPP structure.

Decoder and the Multi-scale Loss Function
In this section, we introduce the decoder part of our model. Based on the encoder structure mentioned above, we propose a two-step decoder structure at the upsampling stage to refine the boundary of objects in the final predictions and fuse the ASPP and encoder-decoder structure together. We also present a multi-scale loss function to solve the optimization problem caused by an excessive number of intermediate layers and to make the whole network more effective.

Proposed Decoder
The decoder structure is to restore the spatial information and improve the final prediction by fusing the multi-scale high-level features with the corresponding scales of low-level features at the upsampling stage. In our model, the resolution of feature maps extracted by ResNet and ASPP is 16 times smaller than the input imageries. Here we propose a two-step decoder structure to restore the feature maps to the original resolution with the fusion of features from ResNet. First, the features maps from the encoder are bilinearly upsampled by a factor of 2 and concatenated with the corresponding low-level features in ResNet that have the same resolution (Conv1 of Bottleneck4 in Block2). Meanwhile, to prevent the corresponding low-level features (512 channels) from outweighing the importance of the high-level encoder features (only 256 channels), we apply a 1 1  Figure 3. The structure of ResNet50/101 which consists of one convolution layer and four Blocks. Each Block has several Bottleneck units. Inside the bottleneck unit, there is a shortcut connection between the input and output. In this study, we choose ResNet101 as the backbone of our model.

Atrous Spatial Pyramid Pooling
In this study, we utilize ASPP after ResNet to extract multi-scale contextual information further. ASPP is a parallel structure of several branches that operate to the same feature map and fuse the outputs in the end and it was first introduced in DeepLab_v2 network. ASPP employs the atrous convolution [32] in each branch. Different from standard convolution, atrous convolution has a rate parameter which adds the corresponding quantity of zeros between the parameters in the convolution filter. This operation equals the downsampling, convolution and upsampling process but has much better performance without increasing the number of parameters that maintains the efficiency of the network. The ASPP structure has two versions. The original one in DeepLab_v2 includes four branches of atrous convolution with rate 6, 12, 18, 24. But the convolution filter with rate 24 is close to the size of input feature maps, only the center of it takes effect. In DeepLab_v3 [33], it was replaced by a 1 × 1 convolution. Moreover, an image pooling branch is also appended to incorporate global context information. In our model, we employ the advanced ASPP structure.

Decoder and the Multi-scale Loss Function
In this section, we introduce the decoder part of our model. Based on the encoder structure mentioned above, we propose a two-step decoder structure at the upsampling stage to refine the boundary of objects in the final predictions and fuse the ASPP and encoder-decoder structure together. We also present a multi-scale loss function to solve the optimization problem caused by an excessive number of intermediate layers and to make the whole network more effective.

Proposed Decoder
The decoder structure is to restore the spatial information and improve the final prediction by fusing the multi-scale high-level features with the corresponding scales of low-level features at the upsampling stage. In our model, the resolution of feature maps extracted by ResNet and ASPP is 16 times smaller than the input imageries. Here we propose a two-step decoder structure to restore the feature maps to the original resolution with the fusion of features from ResNet. First, the features maps from the encoder are bilinearly upsampled by a factor of 2 and concatenated with the corresponding low-level features in ResNet that have the same resolution (Conv1 of Bottleneck4 in Block2). Meanwhile, to prevent the corresponding low-level features (512 channels) from outweighing the importance of the high-level encoder features (only 256 channels), we apply a 1 × 1 convolution to Remote Sens. 2019, 11, 20 6 of 18 reduce the number of channels to 48. After the concatenation, another 1 × 1 convolution is applied to reduce the number of channels to 256. Then we repeat this process to concatenate the lower-level features in ResNet (Conv1 of Bottleneck3 in Block1). In the end, two 3 × 3 convolutions and one 1 × 1 convolution are applied to refine the features followed by a bilinear upsampling by a factor of 4. Figure 4 shows details of our decoder structure.
Remote Sens. 2018, 10, x FOR PEER REVIEW 6 of 18 convolution to reduce the number of channels to 48. After the concatenation, another 1 1  convolution is applied to reduce the number of channels to 256. Then we repeat this process to concatenate the lower-level features in ResNet (Conv1 of Bottleneck3 in Block1). In the end, two 3 3  convolutions and one 1 1  convolution are applied to refine the features followed by a bilinear upsampling by a factor of 4. Figure 4 shows details of our decoder structure. Figure 4. The architecture of our proposed fully convolutional network with the fusion of ASPP and encoder-decoder structures. ResNet101 followed by ASPP is the encoder part to extract multiple scale contextual information. While the proposed decoder shown as the purple blocks refines the boundary of object. In the end, the multi-scale loss function guides the training procedure.

Multi-scale Loss Function
The loss function is a necessary component in the deep learning model to calculate the deviation value between the prediction and groundtruth and optimize the parameters through backpropagation. Traditional FCN-based models adopt a single softmax cross-entropy loss function at the end of the network. For our model, we adopt the encoder structure which consists of ResNet101 and ASPP and propose a decoder structure of two-scale low-level features fusion. There is a large number of intermediate layers and their corresponding parameters, so one single loss function [34] is insufficient to optimize all the layers especially the layers in the encoder part (far from the loss function). To solve this, we apply a multi-scale softmax cross-entropy loss function at different scales of prediction. First, we append a loss function at the end of the network after the 4 times upsampling with the groundtruth of the original resolution. Then, we apply a 3 3  convolution followed by 1 1  convolution to the 2 times upsampled feature maps from ASPP and append another loss function with the 8 times downsampled groundtruth. The former loss function at the end guides the whole network training as the traditional method, while the latter one in the middle can further enhance the optimization of the parameters in the encoder. We also put corresponding weights for these two losses as 1  and 2  . The overall loss function is:

Multi-scale Loss Function
The loss function is a necessary component in the deep learning model to calculate the deviation value between the prediction and groundtruth and optimize the parameters through backpropagation. Traditional FCN-based models adopt a single softmax cross-entropy loss function at the end of the network. For our model, we adopt the encoder structure which consists of ResNet101 and ASPP and propose a decoder structure of two-scale low-level features fusion. There is a large number of intermediate layers and their corresponding parameters, so one single loss function [34] is insufficient to optimize all the layers especially the layers in the encoder part (far from the loss function). To solve this, we apply a multi-scale softmax cross-entropy loss function at different scales of prediction. First, we append a loss function at the end of the network after the 4 times upsampling with the groundtruth of the original resolution. Then, we apply a 3 × 3 convolution followed by 1 × 1 convolution to the 2 times upsampled feature maps from ASPP and append another loss function with the 8 times downsampled groundtruth. The former loss function at the end guides the whole network training as the traditional method, while the latter one in the middle can further enhance the optimization of the parameters in the encoder. We also put corresponding weights for these two losses as λ 1 and λ 2 . The overall loss function is: During the training phase, our model is trained by the stochastic gradient descent (SGD) algorithm [35] to minimize the overall loss. The best performance was achieved using λ 1 = 0.5 and λ 2 = 0.5. As more constraints were applied, the ASPP and encoder-decoder structure fusion could be more effective. The multi-scale loss function is shown in Figure 4 and the detailed configurations of the proposed network is shown in Table 1.

Dense Conditional Random Fields Based on Superpixel
For dense semantic labeling tasks, post-processing after the deep learning model is a common method to optimize the predictions additionally. The most widely used one is Dense Conditional Random Fields (CRF) [36,37]. As a graph theory-based algorithm, pixel-level labels can be considered as random variables and the relationship between pixels in the image can be considered as edges, these two factors constitute a conditional random field. The energy function employed in CRF is: where x is the label for the pixels in input image, θ i (x i ) is the unary potential that represents the probability at pixel i and φ ij x i , x j is the pairwise potentials that represent the cost between labels x i , x j at pixels i, j. The expression of pairwise potential is: where µ x i , x y = 1 if x 1 = x 2 and µ x i , x y = 0 otherwise, as shown in Potts model [38]. The other expressions are two Gaussian kernels. The first one represents both pixel positions and color, the second one represents only pixel positions. I i and p i are color vector and pixel position at pixel i. The common inputs for CRF are prediction map and RGB image. In our study, we employ a superpixel algorithm (SLIC) [39,40] to boost the performance of CRF. Superpixel algorithm can segment the image into a set of patches and each patch consists of several pixels that are similar in color, location and so forth. The superpixel algorithm has the ability to detect the boundaries of the object in images. The process of our superpixel-based CRF is as follows: Algorithm 1. The process of CRF based on superpixel 1.
Input RGB image I and prediction map from our model L. All the J categories of object in dataset denoted as C = c 1 , c 2 , . . . , c J .

3.
Loop: For i = 1 : M (1) All the N pixels in s i denoted as P i = {p i1 , p i2 , . . . , p iN }. (2) Each pixel p ij has a prediction l ij in L and l ij ∈ C, where C is all the categories of object in dataset.
Update the prediction map as L.

5.
Apply DenseCRF to I and L, output the final prediction L f inal .

Datasets
We evaluate our dense semantic labeling system on the ISPRS 2D high-resolution remote sensing imageries which include the Potsdam and Vaihingen datasets. These two datasets are open online (http: //www2.isprs.org/commissions/comm3/wg4/semantic-labeling.html) and captured by airborne sensors with 6 categories: impervious surfaces (white), building (blue), low vegetation (cyan), tree (green), car (yellow), cluster/background (red). Their ground sampling distance are 5 cm and 9 cm. The Potsdam dataset contains 38 imageries with 5 channel data of red, green, blue, infrared and digital surface model (DSM) [41] at resolution of 6000 × 6000; all the imageries have corresponding pixel-level groundtruth, 24 imageries are for training and 14 imageries are for testing. While the Vaihingen dataset includes 33 imageries with 4 channel data of red, green, infrared and DSM at approximate resolution of 2500 × 2500. Similar to the Potsdam dataset, all the imageries have corresponding pixel-level groudtruth. 16 imageries are for training and 17 imageries are for testing. For DSM in these two datasets, we utilize the normalized DSM in our evaluation. Figure 5 shows a sample of the imagery.

Preprocessing the Datasets
All the imageries in datasets need be preprocessed before feeding to our model and the preprocessing operation consists of two parts, slicing and data augment.
The resolution of the imageries is too high. Due to the memory limit of the GPU hardware, feeding them directly to FCN-based model is impossible. To deal with this problem, there are two common methods, namely slicing and downsampling [42]. Downsampling will destroy the spatial

Preprocessing the Datasets
All the imageries in datasets need be preprocessed before feeding to our model and the preprocessing operation consists of two parts, slicing and data augment.
The resolution of the imageries is too high. Due to the memory limit of the GPU hardware, feeding them directly to FCN-based model is impossible. To deal with this problem, there are two common methods, namely slicing and downsampling [42]. Downsampling will destroy the spatial structure of objects especially small size objects such as cars and low vegetation, so slicing is the better choice. In this study, according to the capacity of GPU memory, we slice the training imageries into 512 × 512 patches with an overlay of 64 pixels (striding 448 pixels) and slice the test imageries with the same size without overlay.
Deep learning is a data-driven method, acquiring accurate results rely on the diversity and quality of the datasets. Data augment is an effective way to improve performance with the same amount of data [43]. In this study, we employ several specific methods. The problem of color imbalance, which is usually caused by the change of seasons and the incidence angle of sunlight has a significant influence in remote sensing imagery research. To solve this problem, we change the brightness, saturation hue and contrast randomly to augment the datasets. Object rotation is another problem to deal with. Unlike the general images, remote sensing imageries are captured in the air with different shooting angles. In order to solve it, we flip the imageries in horizontal and vertical directions randomly. For the problem of objects with multiple scales, we rescale the imageries from a factor of 0.5 to 2 and apply padding or cropping to restore the original resolution.

Training Protocol and Metrics
Our proposed model is deployed on the TensorFlow deep learning platform [44] with one NVIDIA GTX1080Ti GPU (11GB RAM). Because of the limit of memory, the batch size of input imageries is set to 6. For the learning rate, we have explored different policies, including fixed policy and step policy. The results show that 'poly' learning rate policy is the best one. The formula is: (4) where initial_learning_rate = 0.007, power = 0.9 and max_iteration = 100000 in this study. The training time of the proposed network is 21 hours. The optimizer that we employed is stochastic gradient descent (SGD) with a momentum of 0.9. Our post-processing method of superpixel-based DenseCRF was implemented based on Matlab and the open source PyDenseCRF package.
The metrics to evaluate our dense semantic labeling system involve 4 different criteria: overall accuracy (OA), F 1 score, precision and recall. They have the high frequency to be employed in the former works. The formula is as follows: where P is the number of Positive samples, N is the number of negative samples, TP is the true positive, FP is the false positive and FN is the false negative.

Experimental Results
To better evaluate our Dense semantic labeling system, U-net, DeepLab_v3 and even the newest version DeepLab_v3+ [29] are adopted as the baseline for the comparison to our proposed model. Moreover, classic DenseCRF is employed to make a contrast with our superpixel-based DenseCRF. It should be noted that all of the metric scores are computed with the pixels of the object boundary.
Our proposed model achieves 88.3% overall accuracy on the Potsdam dataset and 86.7% overall accuracy on the Vaihingen dataset. Figure 6 shows a sample of the result of our proposed model on the Potsdam dataset and Figure 7 shows a sample of the result on the Vaihingen dataset. The first column is the input high-resolution remote sensing imageries; the second column is their corresponding groudtruth; and the last column represents the prediction maps of our model. The detailed results in these two datasets are shown in Tables 2 and 3.
where P is the number of Positive samples, N is the number of negative samples, T P is the true positive, FP is the false positive and FN is the false negative.

Experimental Results
To better evaluate our Dense semantic labeling system, U-net, DeepLab_v3 and even the newest version DeepLab_v3+ [29] are adopted as the baseline for the comparison to our proposed model. Moreover, classic DenseCRF is employed to make a contrast with our superpixel-based DenseCRF. It should be noted that all of the metric scores are computed with the pixels of the object boundary.
Our proposed model achieves 88.3% overall accuracy on the Potsdam dataset and 86.7% overall accuracy on the Vaihingen dataset. Figure 6 shows a sample of the result of our proposed model on the Potsdam dataset and Figure 7 shows a sample of the result on the Vaihingen dataset. The first column is the input high-resolution remote sensing imageries; the second column is their corresponding groudtruth; and the last column represents the prediction maps of our model. The detailed results in these two datasets are shown in Table 2 and Table 3.

The Importance of Multi-scale Loss Function
Encoder-decoder and ASPP are two powerful network structures that have been demonstrated by former works. The objective of our proposed model is to fuse them to achieve a better labeling performance. Experiments show that simply assembling the proposed encoder and decoder with the

The Importance of Multi-scale Loss Function
Encoder-decoder and ASPP are two powerful network structures that have been demonstrated by former works. The objective of our proposed model is to fuse them to achieve a better labeling performance. Experiments show that simply assembling the proposed encoder and decoder with the traditional single loss function at the end of the network cannot obtain any improvement compared with DeepLab_v3+ model. Due to the complexity of the network after fusion, the amount of parameters increases significantly, so additional guidance is needed to make gradient optimization smoother. Different from the single loss function, the multi-scale loss function can better guide the network during the training procedure. As shown in Table 4, the overall accuracy of the Potsdam and Vaihingen datasets have been improved by 0.33% and 0.82% respectively. Meanwhile, the precision, recall and F 1 score have also been improved. The improvement indicates that the proposed decoder structure of two scales feature fusion can take effect with the multi-scale loss function. Both the decoder structure and the multi-scale loss function are essential to our model.

Comparison to DeepLab_v3+ and Other the State-of-art Networks
DeepLab is a series of models that consist of v1 [45], v2, v3 and v3+ versions. Each of them achieved the best performance on several datasets such as Pascal voc2012 [46] and Cityscapes [47] at different time points in the computer vision field and it can be said that DeepLab is the most successful model in the dense semantic labeling tasks which are also called semantic segmentation tasks. Among them, DeepLab_v3+ is the newest version, published in early 2018. On the basis of the improved ASPP structure, DeepLab_v3+ model employed a simple encoder-decoder structure which only fused one scale of low-level feature maps after ASPP. Different from it, our proposed model adopts a more complex encoder-decoder structure with the fusion of two scales of low-level feature maps and an additional multi-scale loss function to enhance the learning procedure. Results on the Potsdam and the Vaihingen (Table 5) demonstrate that our model slightly improves the performance on remote sensing imageries. We further evaluate our model in the comparison to other classic or state-of-art networks, including FCN, DeepLab_v3, U-net and some methods on the leaderboard of ISPRS 2D datasets. SVL_1 is a traditional machine learning method based on Adaboost-based classifier and CRF. Though deep learning methods show an absolute advantage, it still can be a baseline method. DST_5 [48] employs a non-downsampling CNN that performs better than the original FCN. RIT6 [49] is a new approach published recently which uses two specific ways to extract features and fuses the feature maps at different stages. Table 6 shows the quantitative result of the methods mentioned above. As we can see, our proposed model has less misclassification areas as well as sharper object boundaries. The prediction results are shown in Figure 8.

The Influence of Superpixel-based DenseCRF
Dense Conditional Random Field (DenseCRF) is an effective postprocessing method to further refine the boundary of objects after FCN-based models. However, with the development of networks, the effect of enhancement has become weaker. In this study, we first apply classic DenseCRF after our model and the results show that the accuracy of prediction drops slightly. To improve the performance, inspired by the work of Zhao [50], we employ the superpixel algorithm (SLIC) before DenseCRF (details mentioned in Section 2.3). For overall accuracy, the superpixel-based DenseCRF brings 0.1% and 0.3% improvement on the Potsdam and the Vaihingen datasets respectively. Figure 9 and Table 7 show details. From the imageries, we can see that superpixel-based DenseCRF removes some small errors and the boundary of objects is slightly improved.

The Influence of Superpixel-based DenseCRF
Dense Conditional Random Field (DenseCRF) is an effective postprocessing method to further refine the boundary of objects after FCN-based models. However, with the development of networks, the effect of enhancement has become weaker. In this study, we first apply classic DenseCRF after our model and the results show that the accuracy of prediction drops slightly. To improve the performance, inspired by the work of Zhao [50], we employ the superpixel algorithm (SLIC) before DenseCRF (details mentioned in Section 2.3). For overall accuracy, the superpixel-based DenseCRF brings 0.1% and 0.3% improvement on the Potsdam and the Vaihingen datasets respectively. Figure  9 and Table 7 show details. From the imageries, we can see that superpixel-based DenseCRF removes some small errors and the boundary of objects is slightly improved.

Conclusions
In this paper, a novel fully convolutional network to perform dense semantic labeling on high-resolution remote sensing imageries is proposed. The main contribution of this work consists of analyzing the advantage of existing FCN-based models, pointing out the encoder-decoder and ASPP as two powerful structures and fusing them in one model with an additional multi-scale loss function to take effect. Moreover, we employ several data augment methods before our model and a superpixel-based CRF as the postprocessing method. The objective of our work is to further improve the performance of fully convolutional network on dense semantic labeling tasks. Experiments were implemented on ISPRS 2D challenge which includes two high-resolution remote sensing imagery datasets of Potsdam and Vaihingen. Every object of the given categories was extracted successfully by our proposed method with fewer classification errors and sharper boundary. The comparison was taken between U-net, DeepLab_v3, DeepLab_v3+ and even some methods from the leaderboard including the recently published one. The results indicate that our methods outperformed other methods and achieved significant improvement.
Nowadays, remote sensing technology develops at a high-speed, especially the popularization of unmanned aerial vehicles and high-resolution sensors. More and more remote sensing imageries are available to be utilized. Meanwhile, deep learning based methods have achieved an acceptable result for practical applications. However, the groundtruth of remote sensing imageries are manually annotated and so will take too much labor. Therefore, semi-supervised or weak supervision methods should be taken into account in the future works.