HRCNet: High-Resolution Context Extraction Network for Semantic Segmentation of Remote Sensing Images

: Semantic segmentation is a signiﬁcant method in remote sensing image (RSIs) processing and has been widely used in various applications. Conventional convolutional neural network (CNN)-based semantic segmentation methods are likely to lose the spatial information in the feature extraction stage and usually pay little attention to global context information. Moreover, the imbalance of category scale and uncertain boundary information meanwhile exists in RSIs, which also brings a challenging problem to the semantic segmentation task. To overcome these problems, a high-resolution context extraction network (HRCNet) based on a high-resolution network (HRNet) is proposed in this paper. In this approach, the HRNet structure is adopted to keep the spatial information. Moreover, the light-weight dual attention (LDA) module is designed to obtain global context information in the feature extraction stage and the feature enhancement feature pyramid (FEFP) structure is promoted and employed to fuse the contextual information of different scales. In addition, to achieve the boundary information, we design the boundary aware (BA) module combined with the boundary aware loss (BAloss) function. The experimental results evaluated on Potsdam and Vaihingen datasets show that the proposed approach can signiﬁcantly improve the boundary and segmentation performance up to 92.0% and 92.3% on overall accuracy scores, respectively. As a consequence, it is envisaged that the proposed HRCNet model will be an advantage in remote sensing images segmentation.


Introduction
Remote sensing images (RSIs) collected from satellite and aerial platforms are widely used in various applications, such as land-use mapping, urban resources management, and disaster monitoring [1]. Deep learning based segmentation methods composing of instance segmentation and semantic segmentation, which are crucial for automatic analysis and exploitation of the RSIs in the aforementioned applications. Instance segmentation methods are originated from the region-based CNN (RCNN [2][3][4][5]). These methods focus on region classification with less consideration of the background, which are especially suit for the region highlight field, such as ice-wedge detection [6][7][8][9]. However, for urban scenes, each component needs to be concerned, so semantic segmentation would be a better choice. Due to the rapid development of remote sensing technology, especially the improvement of customized imaging sensors, a massive number of high-quality images are available to be analysed [10]. The ever-increasing RSIs can facilitate massive semantic segmentation methods being applied in satellite remote sensing images analysis.
The remote sensing images segmentation performance is determined by three vital factors: spatial information, global context information and boundary details. First, spatial information of RSIs is beneficial in restoring damaged structure and reducing the effects • Based on the parallel-branch architecture of HRNet, we promote the feature enhancement feature pyramid (FEFP) module to integrate the low-to-high features, improving the segmentation accuracy of each category at different scales. • We propose the multi-level loss functions composing of CEloss, BAloss, and SEloss to optimize the learning procedure.

Related Work
In this section, some related work regarding state-of-the-art remote sensing applications and the model design. The model design is composing of design of the backbone, boundary problems and attention mechanisms, where different approaches are compared coming with their advantages and shortcomings.

Remote Sensing Applications
With the development of remote sensing technology, remote sensing applications such as road detection, urban resources management and land-use mapping are applied to all fields of society. The road detection technology introduced by [23] adopted GF-3 satellite (the first C-band multi-polarized synthetic aperture rada satellite in China) to analyse the road conditions, which is helpful to improve the level of urban management. Hyperspectral remote sensing image is known with rich but redundant spectral information, Ref. [24] presented a hybrid lossless compression technique to reduce the redundant information, which could help efficiently using these RSIs. The newly developed semantic segmentation technology [25] based on deep learning adopted a more efficient technique to help for land-use mapping. Such semantic segmentation technology combined with deep learning inspired us to design a more powerful model for achieving a better performance in remote sensing segmentation.

Model Design
Combining the aforementioned three vital factors: spatial information, boundary details and global context information, three parts of the model design will be introduced respectively.

Design of the Backbone
Semantic segmentation is an end-to-end image segmentation method making the network directly predict the pixel-level label map. Backbone, as the feature extractor of the network, plays a paramount role in the segmentation task as the backbone performance will directly affect the "encoder" and segmentation head [26]. The LeNet in [27] is only capable of dealing with easy tasks due to the convolution layer number (only five). According to [28], the increased depth (numbers of convolution layers) can enlarge the receptive field and strengthen the discriminating ability of each pixel in comparison with increasing the width (dimensions of each convolution layer).
However, increasing the depth of the network will lead to the vanishing gradient problem, making the CNNs difficult to converge. To overcome the gradient vanishing problem, He et al. [29] proposed residual connection structure in Deep Residual Network (ResNet) model to make CNNs getting deeper even over 1000 layers without gradient vanishing problem. Based on that, some variants such as Densely Connected Convolutional Networks (DenseNet [30]) are designed to enhance the connection between residual units, and lift the ResNet a bit. In comparison with ResNet, HRNet is parallel-branch based model that can appropriately cope with multi-spectral/multi-scale image, however, such method doesn't consider the effects of global context information and attention mechanism.

Boundary Problems
Boundary problems caused by the network always lead to a degraded classification performance as the network cannot treat the categories (e.g., road, vegetation and buildings) of different scales equally, so some small targets may be neglected [15]. On the contrary, it is more reliable to identify large objects because of clearer spatial information and details can be provided by the sensor. However, small objects with less spatial information are unreliable since the images taken by different sensors may be distorted and the manual labels having error.
To solve these boundary problems, the most commonly used method is skip connections, where shallow layers are fused in the CNNs due to the rich contour information [12,31]. Mou et al. [32] proposed a method that combined FCN [33] and a recurrent neural network (RNN [34]) for achieving accurate object boundary inference and semantic segmentation [35]. U-Net [10,12] adopts a number of skip connections between top layers and bottom layers at the upsampling stage to restore high-resolution information. The aforementioned two methods focus on the boundary information acquisition. Although this method is promising, there is no quantitative verification explaining that the improvement of the boundary is caused by the shallow layers. Meanwhile, the fusion method is not smooth since low-level spatial features (come from shallow layers) may damage high-level semantic features. Inspired by [9] that give an insightful discussion of the fusion algorithms, we design an explainable boundary aware module to smoothly integrate high-level and low-level features.

Attention Mechanisms
Attention mechanism is significant in the phase of obtaining global context information. The attention mechanisms can be divided into two parts, spatial attention mechanism and channel attention mechanism. Spatial attention mechanism is designed to capture long-range dependencies of each pixel [14]. Channel attention mechanism can obtain the relationship of different categories [36]. To lead the CNNs context-aware but also lightweight, Zhong et al. [36] proposed channel attention mechanism to obtain the relevance between different categories. Both spatial attention mechanism and channel attention mechanism yield a promising results for classification problem in [17,37], but these two methods are usually added in the segmentation head (where spatial information is damaged) and extremely heavy-weight. Therefore, the Light-weight Dual Attention module (LDA) is highly desirable at early stages.

Methods
In this section, we firstly present the basic HRNet, and then illustrate the framework of our proposed HRCNet, including the improvements based on HRNet. Finally, each component of HRCNet is described in details.

The Basic HRNet
As depicted in Figure 2, HRNet starts from a high-resolution subnetwork (Branch1), gradually adding high-to-low resolution subnetworks one by one to form more branches and connecting the multi-resolution subnetworks in parallel [13]. It maintains highresolution features, providing n stages and corresponding n branches and n resolutions (here n is set as 4) [38]. After the input, two stride 3 × 3 convolution layers (see Stem section in Figure 2) decrease the resolution to 1/4 and increase the width (number of channels of the convolution layers) to 64. The channel number C (could be selected as 32 and 48 in HRNet, which represent HRNet_W32 (W means width) and HRNet_W48, respectively) in different branches are in turn set as C, 2C, 4C and 8C, respectively. Meanwhile, the resolution decreases as H/4 × W/4, H/8 × W/8, H/16 × W/16 and H/32 × W/32. In application to semantic segmentation, the final four output features are mixed up to generate multi-scale semantic information [39]. Partial feature maps of stage1 are visualized (see Heat map), the red region represents the focus area. It can be seen that the shallow features focus more on structural information. The multi-branch parallel structure of HRNet can efficiently obtain spatial information, but it doesn't give a consideration for global context information and boundary information.

Framework of the Proposed HRCNet
To make up for the shortcomings of HRNet, HRCNet adopts the following designs. As is shown in Figure 3, the proposed model framework consists of three sections including backbone, segmentation head, and loss functions. First, the backbone is introduced as a feature extractor to obtain semantic information aiming at downsampling (Stem) the input to expand the receptive fields and get contextual information (Light HRNet). Second, the segmentation head is applied to rebuild high-resolution features, employing the low-tohigh structure to fuse the four branches after features enhancement (FEFP). Finally, the multi-level loss function is proposed to supervise the classification of the boundary, pixels, and the categories by integrating three various loss functions (BAloss, CEloss and SEloss).

Light-Weight High-Resolution Network (Light HRNet)
In practical scenario, it is essential to guarantee the accuracy of the model and improve real-time performance. Then, the computation efficiency should be high. Inspired by light-weight semantic segmentation networks, such as BiseNet [40,41], ICNet [42], ShuffleNet [43,44], MobileNet [45][46][47], it can be found that ResNet based network (e.g., ResNet18, ResNet50, or some variant networks) is the commonly used backbone because of the high efficiency. In addition, parallel-branch architecture is proved to be efficient [48,49]. As shown in Figure 4, the light-weight HRNet architecture adopts the parallel-branch ResNet as the backbone, the number of each stage is reduced to one and the minimum residual units are kept in each branch of the same stage. Compared with these aforementioned networks, great performance improvements can be guaranteed using the proposed Light HRNet with a relatively high computation efficiency.

Light-Weight Dual Attention (LDA) Module
It can be followed from Figure 5, LDA module is composed of LSA module and LCA module to obtain spatial relevance and dimension relevance respectively, where both attention modules are light-weight. The middle branch in the LDA module is residual unit, where can be used to get feature representation. LCA module is applied behind the residual unit for that dimension relevance is high-level semantic information. LSA module is parallel with residual unit using the same input 'X' as residual unit, since high-resolution features are suited for obtaining spatial information. To verify the design, the position of LCA module and LSA module are exchanged, where the results indicate that the proposed LDA yields the best performance. Meanwhile, the number of LDA module is also taken into consideration to balance the computation load and final performance. The results show that one LDA module has the ability to extract the contextual information rather than using unlimited growing LDA numbers.

Light-Weight Spatial Attention (LSA) Module
Global Context Network (GCNet [50]) in LSA module presents that for each pixel in a H×W feature map, they almost learn the same feature map by using the conventional Non Local Networks [14] and the global context for each pixel is location independent. Therefore, to simplify the calculation, one feature map could be enough to represent the relationship between each pixel and the whole H×W pixels. Given an input feature map X ∈ R C×H×W , the calculation details are summarized as follows: 1. The first branch applies 1×1 convolution to X to generate a feature map with the size of R 1×H×W , then reshape it to R HW×1×1 and softmax function is applied after that. The second branch reshapes X to R C×HW . To this end, two branches' results are multiplied to obtain the feature X 1 ∈ R C×1×1 . F(·) denotes convolution operation, α(·) denotes softmax function, f r (·) denotes reshape, and ⊗ in red denotes matrix multiplication.
2. To reduce the number of parameters after the 1×1 convolution, feature X 1 turns into the size of R C/r×1×1 , where r is the bottleneck ratio usually be set to 16. Then, batch normalization (BN [51]) and activation function (ReLU [52]) are applied to improve the generalization ability of the network. After that, the feature to the size of R C×1×1 is restored and added to X, getting the final output Y 1 ∈ R C×H×W . ⊕ in red denotes the channel-wise summation operation, and f bn&relu (·) denotes BN as well as ReLU.

Light-Weight Channel Attention (LCA) Module
Similar as LSA module, LCA module pays attention to the relevance between each channel (C). Given an input feature map X 2 ∈ R C×H×W , global average pooling is adopted to each channel of the feature map with the size of H×W to generate a global representation of the feature. Then, two fully connected layers are added with the bottleneck ratio r = 16 to reduce the parameters. Sigmoid function works after the aforementioned operations by multiplying with X 2 . Here, F gap (·) denotes global average pooling, F f c (·) means fully connected layer, ⊗ denotes channel-wise multiplication, β(·) denotes sigmoid function, and Y 2 is the output of LCA module.

Feature Enhancement Feature Pyramid (FEFP) Module
FEFP module is proposed to replace the original fusing operation in HRNet for the aim of utilizing the multi-scale contextual information in the parallel architecture. Feature Pyramid Networks (FPN) has the ability to generate features of four scales by employing downsampling operation, and fuses the features step by step [53]. However, the multiscale information is highly related to the original features but along with limited semantic information [54]. To overcome the defects of FPN, some improvements are made. First, four branches' outputs of HRCNet are utilized to replace the four scales' features without using the spatial-reduction downsampling operation. Second, the advantages of DenseNet (densely connected operation between c to strengthen information exchange among features) and Atrous Spatial Pyramid Pooling (ASPP: dilated convolution with different rates to obtain multi-scale information) module are combined together [55]. Therefore, in our work, a FEFP module is presented shown in Figure 6 by incorporating dense connections, ASPP and FPN to make the best use of high-level semantic information.

Multi-Level Loss Function
Loss function design is one of the most important step in deep learning because it guides the CNNs to optimize model parameters during the back-propagation period (e.g., loss function is normally written as the deviation value between the prediction and ground truth). As displayed in Figure 7, the conventional method applied to semantic segmentation is cross entropy loss (CEloss), which is helpful to calculate the mean of each pixel's loss. In this proposed model, three loss functions are integrated to supervise the training data at different levels from various perspectives. Conventional CEloss is mainly employed for supervising the model in pixel-level. Boundary aware loss (BAloss) is to supervise the classification of object boundaries in region-level. The last semantic encoding loss (SEloss) is designed for supervising the classification in image-level. To help all the three loss functions work better in the proposed model, the corresponding three convolution modules are designed as follows.

Cross Entropy Loss for Pixel-level Classification
As is shown in CE from Figure 7, this module accepts the outputs of FEFP module (X 3 ) and BA module. The output of FEFP module represents high-level semantic information for pixel classification. The output of BA module represents high-resolution information for boundary classification. The output of CE module applies the result of SE module for auxiliary judging what categories are included in the images, since SE module could help to justify the categories from a global perspective.

Boundary Aware Loss for Region-level Classification
In the segmentation task, the prediction of the boundary is often ignored because the objects boundary only accounts for a small part of the images, and the boundary is often clear once the camera is stable [56]. But for RSIs, the camera is mobile and images are photographed in an extremely long distance so that the boundary is distorted and the proportion could be much more [57]. To reduce the impact of uncertain border definitions on the evaluation, the official proposes a reference set without boundary. However, there should be another way to improve the boundary quality for the segmentation result. By referring to some conventional edge detection networks [58][59][60][61], high-resolution features are always adopted to get the boundary. However, high-resolution features are lack of semantic information, resulting in misclassification. Moreover, the boundary of the conventional boundary label is too thin resulting in the increased training difficulties.
To overcome these drawbacks, we design the boundary aware module as displayed in the left of Figure 7. It combines the outputs of the stem (X 1 ) and the first branch (X 2 ). The former possesses high resolution and structural information, the latter possesses high resolution and high-level semantic information. Both features are fused to generate binary classification results. Then, this method uses the boundary labels to supervise the results to urge the model to learn a clear boundary. The boundary labels are particularly designed as the official requirements, using a circular disc of 3 pixel radius to erode the boundary and divide the images into two regions.

Semantic Encoding Loss for Image-Level Classification
In the standard training process of semantic segmentation, the network is learnt from isolated pixels (per-pixel cross-entropy loss for the given input image and ground truth label), so the network may have difficulty in understanding context with no global information [22]. Some non-existing categories may be wrongly predicted for the lack of global information. Therefore, the SE module shown in the right of Figure 7 is designed to make global judgement. SEloss predicts the categories of the inputs with a very small extra computation cost. Unlike per-pixel loss, SEloss considers large and small objects equally, where is helpful to improve the segmentation accuracy of small objects. In addition, the output of the semantic encoding module is a one-dimensional vector representing the existing categories, which guides the results of semantic segmentation and filters out misclassified categories.

Multiple Loss Functions Fusion
The details of this proposed multi-level loss function is introduced step by step as follows: The CEloss function as the most commonly used loss function in semantic segmentation, it is defined as: where η i (x, y) ∈ {0, 1} is the label of pixel(x, y) belonging to category i, N is the number of categories, a i is the probability of belonging to category i ∈ {1, N} at pixel(x, y).
The BAloss function is commonly used in binary semantic segmentation, the loss is defined as: η(x, y) log e a 0 e a 0 + e a 1 + (1 − η(x, y)) log e a 1 e a 0 + e a 1 (5) whereη(x, y) ∈ {0, 1} is the label of pixel(x, y), a 0 is the probability of boundary pixels, a 1 is the probability of non-boundary pixels. The SEloss function consists of two parts, the first part turns the label to one dimensional category vector, the second part adopts multi-class binary cross entropy to calculate the loss. SEloss is defined as: where η(i) ∈ {0, 1} is the i th category of the category vector, a i is the probability of belonging to category i ∈ {1, N}.
To obtain high-quality regional segmentation and clear boundary, we propose to define L all as a hybrid loss: λ 1 , λ 2 , and λ 3 are utilized to balance the losses during training, and proposed to be set as 1, 0.9, 0.2, respectively during the training process. To help under the training process, we draw the pictures (see Figure 8) (left) Epoch vs Loss, (right) Epoch vs Accuracy, OA represents overall accuracy, F1 means F1 score. Both the metrics are defined in Section 4.

Experiment
In this section, the datasets and experimental settings are introduced first and then two sets of experiments applied on Potsdam and Vaihingen datasets will be analysed.

Datasets
The proposed HRCNet is evaluated on the ISPRS 2D semantic benchmark datasets which include the Potsdam and Vaihingen datasets. Both datasets can be freely downloaded (http://www2.isprs.org/commissions/comm3/wg4/semantic-labeling.html). Six categories are manually labelled by six colors on these datasets, including impervious surfaces (white), building (blue), low vegetation (cyan), tree (green), car (yellow), cluster/background (red). The details of both datasets are listed as Table 1.

The Potsdam Dataset
The Potsdam dataset is a high-resolution airborne image dataset collected over the city of Potsdam. It contains 38 patches (6000 × 6000 pixels), where each one consists of a true orthophoto (TOP), digital surface model (DSM), and the ground sampling distance of 5 cm. The TOP contains different bands information: near infrared (NIR), red (R), green (G), and blue (B). In this study, RGB bands are employed as features with no DSM information. Also, the dataset is divided into the training set and testing set to avoid overfitting problems (could be described as a phenomenon that highly dependent on the features of training set but performs bad on testing set), where 24 patches for training and the remaining 14 for testing. Due to the limit of hardware, we cut the sample to slices with the size of 384 × 384 pixels. And to avoid the impact of the cutting operation, 72 and 192 pixels' overlapping are adopted in training and testing datasets, respectively. Finally, we get 8664 slices for training and 13,454 slices for testing.

The Vaihingen Dataset
The Vaihingen dataset shows a relatively small village with many detached buildings, where the defined object categories are the same as the Potsdam dataset. This dataset contains 33 patches with different size from 1996 × 1995 to 3816 × 2550 pixels. The ground sampling distance of the TOP (containing IR (Infrared), R, G bands) and the DSM is 9 cm. In this study, the patches of the TOP is used with 16 patches for training and 17 patches for testing. The use size and overlapping pixels are same as Potsdam datasets. Finally, 817 and 2219 slices are used for training and testing, respectively.

Experiment Settings and Evaluation Metrics
The experiments (see Table 2) are run on a high performance computing (HPC) resource NVIDIA RTX2080Ti GPU (11 GB RAM) by applying the Pytorch [62] deep learning framework. The Compiler and program are pycharm and python, respectively. The commonly used stochastic gradient descent (SGD) optimizer is adopted to direct the optimization. Learning rate (LR) and batch size (BS) are obtained through experiments. Loss functions adopt the aforementioned CEloss, BAloss and SEloss. The poly learning rate policy is used to make the training process smooth, where is expressed as follows: Initial_learning_rate is set as 0.01 and 0.08 on Potsdam and Vaihingen datasets, respectively. Iteration is calculated according to the current epoch, max_iteration is the training epoch multiplying the number of training set images. power is set as 0.9.
The evaluating metrics follow the official advice, including overall accuracy (OA) score, F1 score, precision, recall and the commonly used mean intersection over union (mIoU, the average of the six categories' IoU) in semantic segmentation field. The formula is shown below: where P, N, TP, TN, FP, and FN represent the positive, negative, true positive, true negative, false positive, and false negative pixels in the prediction map, respectively.

Training Data Preparation
Training data preparation is of paramount before using these data. Due to the the memory limits of the GPU hardware, it is not possible to send the whole image to the model. Here the method introduced in [1] is applied for training data generation where the original training images can be cut into small pieces remaining the objects spatial structure. These images are cut into 384 × 384 size with an overlap of 72 pixels to prevent the loss of space structure. Then, considering the importance of multi-scale information, we randomly resize the slices as different scales (0.5, 0.75, 1.0, 1.25, 1.5). When all these data augmentation methods are applied, it is equivalent to expanding the dataset by 20 times. The mentioned methods will extremely increase the diversity of the dataset, improving the applicability of the proposed model.

Testing Data Preparation
Testing data preparation aims at preparing the testing data before the network performance being validated. By considering the hardware limitation, the testing data also is cut into slices. Due to the demand on the completed prediction maps, we introduce three methods to restore the prediction maps, the conventional method, 'sliding window' method, and our proposed method. The conventional method doesn't rebuild the edge of the slices as it employs the slices without (w/o) overlap, which lead to the spatial structure reduction. The 'sliding window' method (w/sw) adopts the slices with an overlap of 192 pixels, and adds the overlapping part of two slices to rebuild the edge. However, this operation may lead to a wrong prediction since the two slices cannot promise generating the same result in the overlapping part. Our proposed method (w/ours) also adopts the same slices as 'sliding window' method, but for the overlapping part, we only adopt the middle 192 pixels' square, which reduces the spatial structure reduction of the edge. As is shown in Table 3, by applying our proposed method, we get nearly 0.6% promotion in each evaluating metric. Figure 9 shows the improvement in the boundary by the above three methods. The red line in the input image represents the junction of the slices. As displayed in the black circle of the prediction map, our proposed method (w/ours) gets a better performance in the junction. Consequently, the following experiments all adopt our proposed testing method.

Experimental Results
According to the same network framework (Pytorch [62]) and settings, the following networks are compared with FCN [33], PSPNet [63], FPN [53], UNet [12], DeepLab_v3 [64], DANet [17] and a light-weight network BiseNet_v2 [41]. Meanwhile, the Top 1 methods called SWJ_2 [65] and HUSTW5 in Potsdam and Vaihingen 2D Semantic Labeling challenge are also compared, respectively. For ease of analysis, the best results in the table below are highlighted in bold. Table 4 shows the quantitative results on the Potsdam dataset where considering the boundary information (full reference set). DeepLab_v3 and PSPNet with the multi-scale fusion modules achieve a good performance in OA and F1 scores. SWJ_2 and DANet containing the attention module to obtain the global context information also perform well. BiseNet_v2 is less capable than the aforementioned four models, but with an extremely lower Giga Floating-point Operations Per Second (GFLOPS) and Params. Our proposed model HRCNet_W32 (with extremely less GFLOPS and Params) improves the Recall score by 0.9% and the F1 score by 0.19% compared with SWJ_2. In addition to HRCNet_W32, HRCNet_W48 (with Flip, Flip represents the data augmentation method called flip testing) also yields the best performance.
The results using no boundary reference set are compared in Table 5. The proposed two models (HRCNet_W32 and HRCNet_W48) completely surpass SWJ_2 method from all evaluation metrics. Table 6 displays the IoU of each category. Taking DeepLab_v3 as the baseline, the proposed HRCNet (W48) gets an average increase of 0.95%, it should be noticed that the categories building (+1.20%) and car (+1.16%) get more promotion, which proves the effectiveness of the proposed modules (LDA etc.). The prediction maps (see Figure 10) show that our proposed models achieve a better performance by focusing on the integrity of the building and the details of small objects (seen in the black circles). Table 7 displays the results on the Vaihingen dataset. GFLOPS and Params are not taken into account since HUSTW5 is not available. Based on the full reference set, the HRCNet_W48 (with Flip and MS, where MS represents the data augmentation method called Multi-Scale testing ) improves the Recall by 2.97%, the Precision by 0.27%, and the F1 by 1.57% compared with HUSTW5, respectively. The supplementary results using various methods on the no boundary reference set are shown in Table 8, and it can be noticed that the proposed method achieves the best performance over HUSTW5 (" †" represents Flip and MS [66]). Table 9 shows the IoU scores on Vaihingen dataset. On the baseline of UNet, our model gets the same conclusion as the experiments on Potadam dataset for achieving a greater improvement of large object (building) and small object (car).
The improvements using Vaihingen dataset is much more than using Potsdam dataset. The likely reason is due to the sensing image composition, where the category of building accounting for a large proportion in the whole image in Vaihingen dataset. Therefore, it is particularly essential for the model to obtain the spatial information and boundary details. Figure 11 displays the ability of each model to extract spatial information. It could be seen in the boundary of the buildings, our proposed models could predict a sharp boundary and complete structure in Vaihingen dataset.  (j) HRCNet_W32 † (k) HRCNet_W48 † Figure 10. Prediction maps of the compared methods on the Potsdam dataset. " †" means using data augmentation (Flip testing) methods. (j) HRCNet_W32 † (k) HRCNet_W48 † Figure 11. The prediction maps of the above methods on the Vaihingen dataset. " †" means using data augmentation (Flip and MS testing) methods.

Discussion
In this section, first, two sets of ablation experiments are used to verify the effectiveness of proposed modules (LDA, FEFP and multi-level loss function). Then, the reason for improving the accuracy is analysed and the improvements of the segmentation results are visualized. Finally, the improvements compared to previous research and the limitations of the proposed methods are discussed.

Ablation Experiments
To demonstrate the effectiveness of our proposed modules, two ablation experiments are performed on the Potsdam dataset. In addition, to eliminate the possibility of the improvement caused by the increase of parameters and computation load, a comparative study is conducted before and after adding different architectures. The basic parameters are abbreviated as Params, the unit is MByte (M) and the calculation amount is expressed by GFLOPS (Giga Floating-point Operations Per Second). The following experiments are performed with the same setting.
As is shown in Table 10, LSA module, LCA module, and FEFP module are performed by HRNet_W32_S (S means a light-weight HRNet_W32) and HRNet_W48, respectively. Based on the HRNet_W32_S model, LSA and LCA modules yield an approximate increase of 0.1% in F1 and OA scores, and two modules promote 0.19% in OA score and 0.22% in F1 score by integrating these two modules. The proposed HRCNet_W32 integrating LSA, LCA and FEFP modules achieves 0.40% and 0.31% promotion in F1 and OA scores. On the basis of HRNet_W48 model, LSA and LCA modules working together promote 0.16% in OA score and 0.11% in F1 score, where both scores are less than the improvement of the HRNet_W32_S model. The overall improvements of this proposed architecture using W32 and W48 are 0.40%/0.28% in OA score and 0.31%/0.30% in F1 score. The possible reason is caused by that the same modules are more difficult to perform well on a stronger model (HRNet_W48). However, it still can be concluded that the proposed method (HRCNet) is advantageous in both W32 and W48 compared with the single module based network. It can be visually see from Table 11 where utilized various three multi-level loss functions (CEloss, BAloss, and SEloss) in the ablation experiment. In comparison with CEloss based and BAloss based loss functions, the performance of OA and F1 scores accommodating CEloss, BAloss and SEloss methods outperforms other two methods on two models (W32: 89.08, 89.77; W48: 89.26, 89.98). From the results, we get a conclusion that compared with SEloss, the BAloss brings greater improvement as the proposed model could easily recognize which categories are included in the image (the significance of SEloss), but distinguishing the boundary still be challenging if without BAloss. Table 12 shows the proposed model with the decreasing of Params and GLOPS, the OA and F1 scores achieved the better performance.
To directly verify the model performance combining both high-resolution information and low-resolution information of the image, the segmentation results of HRNet_W32_S and HRCNet_W32 are visualised in Figure 12. Some categories such as building and tree represent global perception, car represents the attention in details.

Improvements and Limitations
Our proposed model focuses on obtaining global context information (LDA module), spatial information (HRNet structure), boundary information (BA module) at the same time. The literature on remote sensing segmentation, such as UNet (spatial information) and DeepLab_V3 (global context information) only focuses one aspect and are not good enough. As for the boundary information, very few works highlight it. Moreover, we first propose to combine the three loss functions (CEloss, BAloss and SEloss) with our proposed modules to improve the aforementioned three vital factors. The visualized prediction maps shown in Figures 9-12 display the improvements of the boundary details and overall segmentation performance in case of adopting our methods. Especially, Tables 6 and 9 show the increase of IoU scores on Potsdam and Vaihingen datasets. The IoU scores of category building and car are above the average IoU (mIoU) scores, which means our models extremely improve the segmentation performance of large and small objects due to the proposed modules.
Also, there are some limitations of our models and the experimental results. The most important part is the choice of band information. In this paper, we choice the RGB bands and IRRG bands on Potsdam and Vaihingen datasets respectively, both of the bands are composing of three divided bands. [7] expressed the combination of different bands will get different results so that proposed several evaluation methods for the combination modes. Moreover, it is obvious that more bands more information. Therefore, the combination modes and the number of bands should be specially considered. Additional explanation, we have tried to apply RGB bands with additional DSM band but get very few promotions. We attribute this to the diversity of the DSM band, which should be especially designed to fit the present models.

Conclusions
In this paper, the CNNs based semantic segmentation of remote sensing images is conducted. Because of the significance of spatial information, global context information and boundary details, a novel architecture named High-Resolution Context Extraction Network (HRCNet) is proposed. In comparison with HRNet in weakly obtaining the global context information and boundary information, the proposed method designs different modules to overcome such problems. LDA module is designed to adopt light-weight dual attention mechanisms to make the model focus on the relevance of different categories. Moreover, FEFP module is employed with high accuracy (W48 achieves 89.13% OA and 89.74% F1 scores on the full reference set) and less computation load (W32 consumes only 11.1 G GFLOPS and 9.1 M Params) to make the use of multi-scale contextual information in comparison with HRNet. Finally, boundary aware (BA) are employed to greatly improve the objects boundary (see Figure 12) and multi-level loss function is applied to optimize the model. The proposed architecture shows an improvement over existing state-of-theart networks and yields the best performance, which achieves 92.0% and 92.3% overall accuracy scores on Potsdam and Vaihingen datasets, respectively. With the increasing physical information of remote sensing images, in the future the DSM information can be considered to further improve the network performance.
Author Contributions: W.Z. and J.L. conceived of the idea; Z.X. verified the idea and designed the study; T.Z. and J.L. analyzed the experimental results; Z.X. wrote the paper; T.Z. and W.Z. gave comments and suggestions to the manuscript. All authors read and approved the submitted manuscript.