Semantic segmentation for remote sensing images based on an AD-HRNet model

ABSTRACT Semantic segmentation for remote sensing images faces challenges of unbalanced category weight, rich context causing difficulties of recognition, blurred boundaries of multi-scale objects, and so on. To address these problems, we propose a new model by combining HRNet with attention mechanisms and dilated convolution, denoted as: AD-HRNet for the semantic segmentation of remote sensing images. In the framework of AD-HRNet, we obtained the weight value of each category based on an improved weighted cross-entropy function by introducing the median frequency balance method to solve the issue of class weight imbalance. The Shuffle-CBAM module with channel attention and spatial attention in AD-HRNet framework was applied to extract more global context information of images through slightly increasing the amount of computation. To address the problem of blurred boundaries caused by multi-scale object segmentation and edge segmentation, we developed an MDC-DUC module in AD-HRNet framework to capture the context information of multi-scale objects and the edge information of many irregular objects. Taking Postdam, Vaihingen, and SAMA-VTOL datasets as materials, we verified the performance of AD-HRNet by comparing with eight typical semantic segmentation models. Experimental results shown that AD-HRNet increases the mIoUs to 75.59% and 71.58% based on the Postdam and Vaihingen datasets, respectively.


Introduction
Aerial images collected from satellites and aviation platforms are an important basis for realizing urban scene perception, resource management, natural disaster detection, ecological environment investigation (Li, Yao, and Zhenfeng 2015;Zhang et al. 2020), and so on.
Semantic segmentation as the key step in realizing automatic analysis and utilization of remote sensing images in the above applications takes some raw image data as input and converts them into a mask with highlighted regions of interest. The traditional methodology for semantic segmentation of remote sensing images mainly applied manual inspection or machine learning algorithms, such as SIFT (Ke and Sukthankar 2004), HOG (Wang, Han, and Yan 2009), N-cut (Shi and Malik 2000), and so on, to extract the relevant features from the image data for subsequent processing and analysis. Unfortunately, these approaches are time-consuming, laborious, costly, and unable to achieve good generalization (Zhu et al. 2017;Ching et al. 2018). With the development of artificial intelligence and computation performance, deep-learning-based models have been widely applied in various fields related to computer vision such as biomedicine (Ching et al. 2018), satellite remote sensing (Zhu et al. 2017), city perception (Dubey et al. 2016), automatic driving (Grigorescu et al. 2020), and so on. Semantic segmentation, as a subfield of computer vision, has also made a significant breakthrough (Long, Shelhamer, and Darrell 2015;Ronneberger, Fischer, and Brox 2015;Sun, Zhao, et al. 2019). However, challenges for semantic segmentation based on high-resolution remote sensing images using current technologies still exist. First, remote sensing images have an extremely imbalanced category weight distribution (Kampffmeyer, Salberg, and Jenssen 2016;Leichtle et al. 2017), which leads to the segmentation accuracy of a class with low weight being far lower than that of a class with high weight. As shown in Figure 1, the proportion of different categories of targets in the data set is unbalanced. For example, the proportion of car is much lower than that of other categories. Secondly, remote sensing images generally have high resolution and rich context information, but many current semantic segmentation models fail to utilize all available global context information . Thirdly, objects in remote sensing images always have large-scale changes, and the boundary of objects is blurry, which is not conducive to segmentation (Deng et al. 2018). As we can see from Figure 2, the objects in the red and yellow boxes in the figure are very hard to identify. Therefore, effectively solving above problems is the key to improve the accuracy of remote sensing semantic segmentation.
To address these issues, we propose a targeted semantic segmentation model by combining the HRNet with attention mechanisms and dilated convolution (denoted as: AD-HRNet). The proposed AD-HRNet used the improved weighted cross-entropy loss function by introducing the median frequency balance to obtain the weight of each category, thus reducing the impact of class imbalance. Then, based on the performance of CBAM ), we designed a Shuffle-CBAM module by adding the attention mechanism to the AD-HRNet, which aimed to decrease computational expense and increase the efficiency of global context information during semantic segmentation. Also, we constructed a mixed dilated convolution + dense upsampling convolution (MDC-DUC) module in the AD-HRNet framework to improve recognition accuracy by extending the receptive field. By applying a learnable upsampling method in the framework of AD-HRNet, the missing details such as segmentation boundary, object edge, and so on, are obtained and recovered. We compared our model with the original version of HRNet model and other typical deep-learning-based methods by using the Potsdam, Vaihingen, and SAMA-VTOL datasets. The results shown that the mIoU of semantic segmentation by using the proposed AD-HRNet based on the first two datasets achieved to 75.59% and 71.58%, respectively, which had gone up by 0.74%, 0.73%, over the original version of HRNet, respectively. In addition, the relevant ablation experiments based on Postdam and Vaihingen datasets and generation test by using ASMA-VTOL also verified the effectiveness of our designed modules. In summary, the contributions of this study include: (1) We proposed an improved HRNet model: AD-HRNet, which deals with relevant problems of aerial remote sensing semantic segmentation by adding the designed modules including preprocessing module, Shuffle-CBAM module, and MDC-DUC module. Specifically, the improved weighted cross-entropy loss function module improves the segmentation accuracy of small targets; Shuffle-CBAM module effectively uses the global context information of images and improves the segmentation accuracy with less computational expense; MDC-DUC module solves the problem of low precision of edge segmentation of multi-scale objects and irregular objects.
(2) The efficiency of AD-HRNet model has been demonstrated by numerous tests. The experimental results illustrated that the performance of the proposed AD-HRNet was better than the other eight typical semantic segmentation models (e.g. U-Net, FCN, CCNet, DeeplabV3, DANet, OCRNet, and UNetFormer) based on three benchmark datasets including Potsdam, Vaihingen, and SAMA-VTOL datasets.

General semantic segmentation
Semantic segmentation is a typical computer vision problem that takes some original data (e.g. plane images, remote sensing images, and panoramic images) as input and converts them into masks with highlighted regions of interest. Semantic segmentation as a basic task in the field of computer vision has a significant impact on the development of urban scene analysis , remote sensing image processing (Diakogiannis et al. 2020), automatic driving (Treml et al. 2016), and other fields. Early semantic segmentation is mainly conducted based on the traditional algorithms such as grey scale segmentation (Panda and Rosenfeld 1978) and conditional random fields (Plath, Toussaint, and Nakajima 2009). Since FCN (Fully Convolutional Networks) was used for semantic segmentation of images (Long, Shelhamer, and Darrell 2015), deep neural-based networks have gradually become mainstream in the image semantic segmentation field due to its performance. The existing approaches on deep-learning-based semantic segmentation for images mainly included two modes. The first one is based on encoder-decoder structure, e.g. U-net (Ronneberger, Fischer, and Brox 2015), SegNet (Badrinarayanan, Kendall, and Cipolla 2017), and DeepLab series . And the other one maintains the high-resolution representation by high-resolution convolution, e.g. HRNet (Sun, Zhao, et al. 2019). In addition, using dilated convolution (P. ) and deformable convolution (Dai et al. 2017) to replacing the conventional convolution was also a common way to adjust the receptive field for some semantic segmentation models. Apart from that, the development of Transformer-based algorithms in the field of computer vision in recent years (Dosovitskiy et al. 2020) also promoted the application of semantic segmentation model with attention mechanism. However, thus far, for complex scenes, making full use of the context information of images and accurately segmenting multi-scale objects with irregular shapes and fuzzy boundaries are still a difficult task. A complex external environment (e.g. sunshine, clouds, and occlusion) also increases the difficulty of semantic segmentation.

Semantic segmentation of remote sensing imagery
At present, the remote sensing images have been widely used in many fields of society, such as natural disaster monitoring (Dong and Shan 2013;Joyce et al. 2009), land resource management (Alqurashi and Kumar 2013), ecological environment investigation (Kerr and Ostrovsky 2003;Song et al. 2020), and so on. The basis of these studies or applications is to accurately and efficiently segment remote sensing images based on targets' category. Therefore, semantic segmentation technology has received much attention in the field of remote sensing. For example, Kemker, Salvaggio, and Kanan (2018) applied a DCNN full-convolution network to the semantic segmentation of multispectral remote sensing images and obtained good results on their RIT-18 dataset. Diakogiannis et al. (2020) developed a new semantic segmentation architecture (named as: ResUNet-a) and a novel loss function based on Dice loss to get reliable and efficient segmentation results from high-resolution aerial remote sensing images. Seong and Choi (2021) constructed a semantic segmentation model combined with a multi-attention mechanism (denoted as: CSAG-HRNet model) to enhance the extraction of buildings. Du et al. (2021) illustrated a novel semantic segmentation method for remote sensing by combining Deeplabv3+ and object-based image analysis (OBIA). Liu et al. (2022) built a novel residual ASPP block with attention net and named it as RAANet to acquire more scale information by introducing attention mechanism. The disadvantage is that it will increase the amount of calculation, and at the same time, it can't solve the problems of class imbalance and low accuracy of object boundary recognition. With the development of transformer in the field of computer vision, semantic segmentation method based on transformer has sprung up. For instance, Meng et al. (2022) proposed a class-guided Swin transformer method for semantic segmentation of remote sensing images, which enhanced the utilization of Vit in obtaining global information. However, in terms of local information acquisition, transformer-based model is weaker than CNN-related models. Therefore, Class-guided Swin Transformer cannot solve the problem of low accuracy of object boundary recognition. Wang et al. (2022) established an UNetFormer based on the encoder-decoder architecture for segmenting urban scene in real-time. They also developed a global-local attention mechanism to get the global and local information of images. Both UNet-Former and our proposed model take into account the importance of obtaining global and local information about an image, but our model is better at resolving object boundary ambiguities and class imbalances than the encoder-decoder architecture used by UNetFormer.

Attention mechanism
The attention mechanism is important for capturing the global information of images. There are three kinds of attention mechanism in the field of computer vision: spatial attention, channel attention, and spatial-channel mixed attention . The spatial attention mechanism is an adaptive spatial region selection mechanism whose purpose is to obtain the context dependence between pixels (X. . For example, ViT (Dosovitskiy et al. 2020) and Non-Local (X. ) models applied spatial attention mechanism to achieve target detection, semantic segmentation, and other basic tasks. In a deep neural network, different channels of different feature maps generally represent different objects. The channel attention mechanism evaluates the importance of each feature channel and then enhances or suppresses different channels for different categories to determine the attention position (Guo et al. 2022). At present, the representative neural network model using channel attention is the SENet (Hu, Shen, and Sun 2018). The basic idea of the SENet model is to add a bypass branch after the normal convolution operation, sequentially perform the squeeze operation and the excitation operation to obtain the weight of each feature channel, and then apply the weight to the original feature channel to learn the importance of different channels. Using the spatialchannel mixed attention mechanism is better than the other two kinds of attention mechanism. However, it also inevitably leads to an increase in memory occupation and computational expense (Zhang and Yang 2021). So, we should pay attention to a trade-off between accuracy, time, and video memory when using the double attention mechanism. Beyond that, the wellknown models based on dual attention mechanism are CBAM , BAM , and DANet (Fu et al. 2019). The attention mechanism used in these models can obtain context information and get a better result during semantic segmentation for images. However, how to construct a suitable attention module for remote sensing images is still a hot topic in the field of sematic segmentation because of its characteristics of large size and rich information.

Methodology
By analysing the limitations of the present models for semantic segmentation of remote sensing images, we constructed a novel framework (named as AD-HRNet) by combining the HRNet with attention mechanisms and dilated convolution, as shown in Figure 3. In our model, we improved the original version of HRNet by introducing the designed Shuffle-CBAM attention module and Mixed Dilated Convolution+ Dense Upsampling Convolution (MDC-DUC) module whilst maintaining the traditional advantages of it. Specifically, we used the improved weighted cross-entropy function to replace the original cross-entropy function in the architecture of HRNet, which was to solve the problem of class imbalance during semantic segmentation. We added an attention mechanism to each module of HRNet and fused the information of feature maps in the dimensions of channel and space without causing too much calculation. It should be noted that we designed a Shuffle-CBAM module and added it to the end of each basic block of the original version of HRNet, as shown in Figure 3. In addition, we also added a new module MDC-DUC to the proposed AD-HRNet, which is used to increase the diversity of receptive fields through mixed convolution blocks and solve the 'gridding' problem caused by conventional expansion convolution. In the framework of AD-HRNet, we used the learnable upsampling convolution to replace the traditional bilinear interpolation method in the architecture of HRNet to capture and recover the missed details. The final semantic segmentation feature map is obtained by layer-by-layer fusion.

Brief introduction for HRNet structure
Most existing semantic segmentation models based on the encoder-decoder architecture used high-to-low-resolution encoders connected in series to extract feature maps and then restored the high-resolution representation through the corresponding decoders. However, the disadvantage of this idea is that the encoder generates a fuzzy feature map with a high probability after multiple convolutions for feature extraction, and some edge details and small target objects will be lost. The HRNet model is the first one to be used for human posture estimation . It uses the high-resolution convolution to keep a high-resolution representation of a branch all the time and applies the module to fuse the feature maps of parallel low-resolution convolution for enhancing the representation. In the original HRNet (Sun, Zhao, et al. 2019) network, the input image is down sampled to a quarter of the original image by two 3 × 3 convolutions. Then, two characteristic images with different sizes are output by one bottleneck block.
Next, three groups of basic blocks with different channel numbers are connected. The structural designs of the bottleneck block and basic block follow the principle of ResNet (He et al. 2016). Finally, four feature maps with different sizes and dimensions are reshaped into the size of the high-resolution subnet by bilinear interpolation, and the output is determined by a 1 × 1 convolution layer.

Improved modules of AD-HRNet
Taking HRNet as the basic architecture, we proposed a novel model AD-HRNet by using the following modules.

The improved weighted cross-entropy loss
By considering the limitation of loss function used in the HRNet model, e.g. unable to solving the class imbalance problem, we applied an improved weighted cross-entropy loss function to solve this issue during semantic segmentation. It should be noted that the weighted cross-entropy loss was firstly proposed by Khoshboresh-Masouleh, Alidoost, and Arefi (2020) to solve the class imbalance problem. In this study, we modified it by adding the median frequency balance method (Badrinarayanan, Kendall, and Cipolla 2017) to get the weight of each category to further weaken the impact of class imbalance and enhance the segmentation accuracy of targets with a small proportion in images. The conventional cross-entropy loss function in the framework of AD-HRNet can be calculated based on Equation (1).
where y is the real category label, p is the predicted probability of the corresponding category, and N is the batch size.
To alleviate the problem of class imbalance during semantic segmentation, we calculate the corresponding weight of each category by the following formulas.
where c i w is the weight of the category, p i is the proportion of each category in the images, i = 1, 2, … , n. The parameter m is the median of the dataset P which is composed by p i and can be denoted as: P = [p 1 , p 2 , … , p n ]. The parameter n i is the number of images with corresponding category and n total is the total number of images of all categories.
We add the computation equation of the corresponding weights of each category into the standard cross-entropy loss function and obtain the final weighted cross-entropy loss formula, as shown in Equation (4).
The Equation (4) is the final loss function of the proposed AD-HRNet. The parameter is the set of weights for all kinds of categories. The variables y i and y i are the ground truth and predict result, respectively.

Shuffle-CBAM
As we have illustrated in the section of literature review, the present attention mechanisms applied in the field of computer vision have some limitations such as increasing the computational expense by using spatial-channel mixed attention mechanism (Zhang and Yang 2021). By analysing the features of CBAM ), we developed a more lightweight attention module, named as: Shuffle-CBAM. The structure of the Shuffle-CBAM is shown in Figure 4.
As shown in Figure 4, the main idea of Shuffle-CBAM is to divide the input feature maps into several groups according to the number of channels first. Then, feature maps belonging to each group are further categorized into two groups based on the dimensions. The spatial attention and channel attention mechanism are, respectively, applied for sub-feature maps of each group. At the final steps of Shuffle-CBAM, all sub-features are aggregated and then interacted by using a channel shuffle mechanism. Details of two kinds of attention modules including spatial attention block and channel attention block applied in the Shuffle-CBAM architecture are shown in Figure 5.
As we can see from Figure 5, in the spatial attention block, given a feature map X [ R C×H×W where C, H, and W represent the channel number, height, and width of it, respectively. First, the feature map X is grouped according to the dimension of the channels. We set the group number G as a super parameter, but we must ensure that G is divisible by the channel number C. After grouping, feature map X is divided into G group X = [X 0 , X 1 , . . . , X G−1 ], X i [ R C/G×H×W . In the channel attention block, the feature map X i is divided into two groups according to the channel dimension, denoted as: . Through the operations of parallel average pooling and maximum pooling for X i1 , we obtain two 2D feature maps: X i1 avg S [ R 1×H×W and X i1 max S [ R 1×H×W . The generated feature maps are concatenated to generate effective feature Figure 4. The structure diagram of Shuffle-CBAM module. We reduce number of feature maps through spatial attention block and channel attention block by grouping strategy to reduce amount of computation required, and then use 'channel shuffle' to interact with information between feature maps to obtain global context information. Figure 5. Detailed design of spatial attention and channel attention blocks. Attention is largely designed to be similar to CBAM, with certain tweaks in details to make our Shuffle-CBAM modules more efficient.
representations, and then a standard 7 × 7 convolutions are used to generate spatial attention maps. Meanwhile, two different spatial context feature vectors X i2 avg C and X i2 max C are generated by using average pooling and maximum pooling for X i2 . The spatial context features vectors X i2 avg C and X i2 max C are sent to a shared network composed by a multilayer perceptron to generate a channel map M S [ R C×1×1 . Finally, the output feature vectors are fused by summing the elements.
The formulas of spatial attention and channel attention are shown as follows.
where f 7×7 denotes executing a standard 7 × 7 convolution operation, Cat is the operation of concatenating tensors according to channel dimensions, AvgPool executes an average pooling operation, MaxPool executes a maximum pooling operation, X i1 is the grouped feature map, and M s [ R H×W is the spatial attention map of the final output. Here, the reason why we used a standard 7 × 7 convolution operation is because it can get the best results .
where s denotes the activation function, MLP executes the multilayer perceptron operation, X i2 is the grouped feature map, and M C [ R C×1×1 is the channel attention map of the final output. Other symbols in Equation (6) are same with Equation (5).
It should be noted that we used a similar method to ShuffleNet (Ma et al. 2018) to aggregate all sub-features. That is, we use the 'channel shuffle' to enable information exchange between different groups. We keep the input feature map and output feature map through the Shuffle-CBAM module consistent in size to maintain the good scalability of our module.

MDC-DUC
Dilated convolution is commonly used to replace regular convolution to extend the receptive field of convolutional kernel, which can obtain more effective information and improve the segmentation accuracy through a small amount of calculation during semantic segmentation (P. ). However, due to the existence of the dilated rate, the convolutional kernel is dispersed, which causes discontinuity in information acquisition and a serious grid effect. Bilinear interpolation (Gribbon and Bailey 2004) is a conventional upsampling method used by many semantic segmentation models including HRNet. The advantage of bilinear interpolation is that the calculation is simple and fast. Since bilinear interpolation predicts the pixel value of the sample point by several nearby pixels, it inevitably causes the loss of image details which reduces the segmentation accuracy. Figure 6. For structural design of MDC, we fused input feature map and output feature map of MDC to obtain more accurate results.
In this study, we construct an MDC-DUC module to replace the bilinear interpolation module in the original version of HRNet by considering both advantages of MDC (Mixed Dilated Convolution) and DUC (Dense Upsampling Convolution) blocks, as shown in Figure 3. In MDC-DUC module, the DUC block is applied to recover more missing detailed information. The proposed MDC block is embedded to the four parallel outputs of the original HRNet to expend the diversity of the receptive field, thus increasing the semantic accuracy at pixel-level, as shown in Figure 6. Specifically, the designed MDC block in this study is consisting of a dilated convolution with three different dilation rates (e.g. d = 1, 2, 5). The relevant formula for the dilated convolution is as follows.
where k is the size of the input convolutional kernel, d is the dilated coefficient, and k ′ is equal to convolutional kernel size after dilated operation. Based on Equation 7, we can see that the receptive field of our MDC block is equivalent to the standard convolution. That is the receptive field of the convolutional kernel includes 3, 5, and 11. The size formula for the output of the convolutional layer is shown in Equation (8).
where i is the feature map dimensions entered, o is the feature map dimensions of the output, p is the size of pooling, s is stride, and ⌊. . .⌋ represents a rounding-down operation. The convolutional kernel size is 3 × 3, so when step size s is equal to 1 and pool size p is equal to the expansion coefficient d, the input and output feature plot sizes are equal. The purpose of designing the MDC block is to capture more semantic information and to make convolutions operable on the feature map of the complete input with no missing regions. The dilated rate used in this paper refers to the mixed dilated convolution principle of . The advantage of using this expansion rate is to minimize the grid effect of the expansion convolution while maximizing the effective information of different scale targets. In addition, we analysed the structure of ResNet and Expansion convolutional blocks with skip connections and proposed a new strategy for adding the skip connection. Specifically, we perform a concatenation operation between the feature map of the mixed convolutional output and the original feature map of the input. This can further replenish the features in the previous feature map of the expanded convolutional block. Then, we use a 1 × 1 convolution to fuse the feature during feature map connection and reduce the subsequent operation amount (see Figure 6). In this study, the DUC block is used to replace the bilinear interpolation used in the original version of HRNet (Sun, Zhao, et al. 2019), as shown in Figure 7. The main idea of the DUC is to use the number of channels to compensate for the loss of feature map size caused by the previous convolution operation such as bilinear interpolation. In addition, DUC is a learnable module that can recover image details better than bilinear interpolation. Assuming to enter an image X [ R C×H×W , where C, H, W are the number of channels, height, and width of the image, respectively. The model outputs a label map with a size of H × W and assigns a corresponding category label to each pixel of the map. Suppose the feature map outputted from the MDC block output is x [ R c×h×w , where h = H/d, w = W/d, and d are the sampling factors. The specific function of DUC is to reshape the feature map x [ R c×h×w into x ′ [ R (d 2 ×c)×h×w through a 3 × 3 convolution. And the feature map x ′ is reshaped into a feature map x′′ [ R c×(h×d)×(w×d) again. Therefore, the number of channels of the final generated feature map is c, the height is H, and the width is W. DUC uses convolution operations directly on the feature plot of the mixed dilated convolutional output to obtain a label map with a final output size of H × W. We upsampled three feature maps with smaller sizes using their corresponding DUCs and then fused them layer by layer.

Experimental setting and evaluation metrics
In this study, we implemented the proposed network model by using Python (version 3.8.2) and PyTorch (version 1.4.0) on workstations containing two NVIDIA GEFORCE RTX 2080Ti blocks. During model training, we utilized an SGD optimizer with the base learning rate of 0.01, the momentum of 0.9 and a weight decay coefficient of 0.0004. The improved weighted cross-entropy function illustrated in Section 3.2.1 was used as the loss function. The batch size was set as 16 and 32 in training process and testing process, respectively. The number of epochs by using Potsdam dataset and Vaihingen dataset was, respectively, 484 and 100 during training, thereby achieving the best performance of the model on 2 GPUs and syncBN. Meanwhile, the augmentation techniques such as random scale and flip were also used during training process. We compared our proposed AD-HRNet with the original version of HRNet and other seven representative classical semantic segmentation models (e.g. U-Net, FCN, CCNet, DeeplabV3, DANet, OCRNet, and UNetFormer) by using these two datasets. In addition, we performed ablation experiments on the individual components to demonstrate the effectiveness of the proposed modules including the improved weighted cross-entropy function at the pre-process module, Shuffle-CBAM module and MDC-DUC module. The generation experiment was also conducted by using the SAMA-VTOL dataset (Bayanlou and Khoshboresh-Masouleh 2021) to validate the effectiveness of the proposed AD-HRNet. The evaluation metrics we used included the overall accuracy, precision, recall, and overall and individual mIOU and F1-score for each category, all of which were related to the confusion matrix shown in Table 1.
Our evaluation index selection follows the general guidelines in the field of semantic segmentation, including the overall accuracy (OA), precision, recall, F1-score, and mean intersection computed over the union (mIoU) (Xu et al. 2020). The formulas of these evaluation indices are where TP, TN, FP, and FN are the true positive, true negative, false-positive, and false negative pixels in the prediction map, respectively.

Dataset
We mainly experimented the proposed AD-HRNet on ISPRS 2D Semantic Segmentation Benchmark Dataset. The ISPRS 2D semantic segmentation benchmark dataset includes the Potsdam and Vaihingen datasets, both of which are freely available. This dataset consists of six categories: impervious surfaces, buildings, low vegetation, trees, cars, and clusters/backgrounds. Details about these two datasets and their pre-processing method are listed in Table 2. Potsdam as one of the most used datasets in the field of remote sensing semantic segmentation, contains 38 6000 × 6000 pixel drone images collected over the city of Potsdam. We divided it into two parts: training sets including 24 photos and testing sets including 14 photos based on the method proposed by Li et al. (2021). Since these high-resolution images are too large to train directly using the common hardware, we split each photo into 512 × 512 patches. To avoid the integrity impact of the splitting, we used a 96-pixel overlap. In the end, we obtained a training set of 5400 images and a test set of 3150 images.
The Vaihingen dataset has numerous independent items and tiny multistorey buildings despite being a very small dataset. This dataset contains 33 remote sensing images ranging in size from 1996 × 1995 to 3816 × 2550, with the same category classification as the Potsdam dataset. In this paper, 16 images are selected as the training set and 17 images as the test set. These images are also split into 512 × 512 patches, and the overlap of 96 pixels is used to avoid the adverse effects of splitting. After the splitting is completed, a training set containing 479 images and a test set of 555 images are obtained.
Apart from these two datasets, we also applied the SAMA-VTOL dataset to validate the effectiveness of the proposed model (Bayanlou and Khoshboresh-Masouleh 2021). The SAMA-VTOL dataset consists of much larger multi-task information and is suitable for learning. In SAMA-VTOL dataset, 10 kinds of targets are defined for the semantic segmentation task, including water, ground, parcel boundary, waste object, vehicle, farmland, vegetation, building shadow, building, and vegetation. Although the website of SAMA-VTOL introduced it, there was no public training data providing for users. The providers of SAMA-VTOL only publicized one testing image for users. Therefore, we used it to validate the generation of AD-HRNet model.

Comparisons between AD-HRNet and other typical models
To validate the performance of the proposed AD-HRNet, we compared it with other eight typical semantic segmentation models based on Potsdam and Vaihingen datasets, including U-Net (Ronneberger, Fischer, and Brox 2015), FCN (Long, Shelhamer, and Darrell 2015), Based on the statistics shown in Table 3, AD-HRNet model achieved a better performance in the W18 and W48 versions than other models. It should be noted that the W18 and W48 represent the width of the high-resolution convolution. By comparing with the original version of HRNet_W18, the proposed AD-HRNet_W18 increased the overall values of mIoU and F1 score by 0.95% and 0.72%, respectively. In addition, the performance of semantic segmentation for high-resolution images by using AD-HRNet_W48 was also improved. The overall values of mIoU and F1 score of AD-HRNet_W48 were, respectively, increased by 0.74% and 0.54% by comparing with HRNet_W48. Although the statistics shown in Table 3 illustrate that the overall values of mIOU and F1 scores by using UNetFormer are better than other models (e.g. U-Net, FCN, CCNet, Dee-plabV3, DANet, and OCRNet), the performance of AD-HRNet is still better than it. According to the visualization results by using these models, UNetFormer was good at obtaining global information. But it fails to recognize the targets with small weight and the edges of multi-scale objects. Beyond that, the comparisons with other typical semantic segmentation models also show that AD-HRNet significantly improved the value of mIoU of each category except for identifying cars based on AD-HRNet_W48, as shown in Table 4. Here, the reason why the proposed model for recognizing cars is not better than HRNet_48 because the weight value of cars in Postdam is too small. Also, the edge of many cars is connected with other targets such as Tree and Cluster, thus affecting the final improvement.
We visualized the semantic segmentation results of all compared models in Figure 8. As we can see from Figure 8, the proposed model can accurately segment different objects and is superior to other models in the integrity and edge of segmentation. In addition, it can also identify small objects well. For example, the proposed AD-HRNet can identify buildings without small area vacancy by comparing with other models. The connectivity of the recognized targets is also kept well. In addition, in Figure 8, we can find that the proposed AD-HRNet can roughly identify the overall structure of the low vegetation class, while other models perform ordinarily in the aspects of segmentation integrity and detail edges. Table 5 shows the comparisons of the proposed model with other semantic segmentation models by using Vaihingen dataset. The experimental results indicated that the proposed AD-HRNet_W18 improves the overall value of mIoU by 1.51% and F1 score by 1.43% relative to HRNet_W18. Meanwhile, AD-HRNet_W48 improves the overall value of mIoU and F1 score by 0.73% and 0.61%, respectively, by comparing with HRNet_W48.  Table 6 illustrates the comparisons of AD-HRNet with other models for each category segmentation. According to the statistical values in Table 6, we can find the value of mIOU and F1 score for classes of Imp. Surf, Build, Car, and Cluster by using AD-HRNet_W18 are higher than HRNet_18. For AD-HRNet_48, the value of mIoU for segmenting the classes of Low veg., Tree, Car, and Cluster is higher than HRNet_48. In Table 6, we also find that the proposed AD-HRNet no matter with W18 or W48 version, the value of mIoU of some classes is not better than HRNet. By analysing the reasons, we found that our model mainly improved the value of mIoU for some small targets. Meanwhile, it also gets more accurate results for edge segmentation. However, this advantage also makes targets with a high weight and blurred edges are segmented into several pieces, thereby reducing the accuracy of segmentation. Figure 9 also shows the visualization of semantic segmentation by using Vaihingen dataset. By comparing with other models, the proposed AD-HRNet identified targets without causing many empty areas, e.g. Building segmentation. For segmenting the classes of Tree and Low vegetation, the proposed AD-HRNet can get more details of target edges than other models.
Based on the quantitative analysis by using two different typical data sets, we find that the performance of AD-HRNet_W18 in sematic segmentation of high-resolution images has a higher promotion by comparing with AD-HRNet_W48. Although the proposed AD-HRNet achieved a better performance than HRNet, improvements are still needed in the future work. For example, how to improve the parallel structure to increase the performance of semantic segmentation in W-48 versions. In addition, the proposed model in this study improved the segmentation results by using both Potsdam and Vaihingen datasets, and the improvement on Vaihingen dataset is slightly larger than that of Potsdam dataset. This may be because the volume of training data of Potsdam dataset is higher than Vaihingen dataset while HRNet also can obtain the global information during training, which make it can recognize edges of targets during testing. For Vaihingen dataset, the proportion of building category is very large which increases the complexity of edge information. In this study, we added the module e.g. MDC-DUC to the framework of AD-HRNet to solve this issue. The experimental results shown the improvement by using Vaihingen dataset is bigger than Potsdam dataset, as shown in Figures 8 and 9. Beyond that, we evaluated the stability of our model by setting the epoch to 100 and 484, respectively, based on Postdam dataset and further compared the performance of AD-HRNet with HRNet based on the evaluation indexes illustrated in Section 4.1 (see Table 7). The statistical results shown that the value of mIoU and F1 score by using AD-HRNet is gone up by 0.54% and 0.38% above the HRNet when the number of epochs is set to 100. The performance of training results gets further enhanced after we increased the number of epochs to 484. This test indicates that the proposed model keeps a better performance than HRNet no matter the number of epochs is less or more. Moreover, we conducted five repeated experiments based on the proposed model by using Postdam dataset. The results shown in above tables are on a medium performance. The standard deviations of mIoU and F1 score computed based on these repeated tests were 0.32 and 0.25, respectively.   Table 8. Based on the experimental results, we can find that the Shuffle-CBAM module and the MDC-DUC module played a certain role in improving the segmentation ability of HRNet by using the Potsdam dataset. The values of mIoU and F1 increased by 0.68% and 0.48%, respectively, by integrating the Shuffle-CBAM module into the framework of HRNet. Additionally, the values of mIoU and F1 improved by 0.81% and 0.57%, respectively, by integrating the MDC-DUC module again into the framework of HRNet. In Table 8, the proposed AD-HRNet improved mIoU and F1 by 0.95% and 0.72%, respectively, by comparing with HRNet. In this study, we used the improved weighted cross-entropy loss function to balance the identification ability for targets with different weights. Although the improvement by using the improved weighted cross-entropy is not obvious by comparing with other modules, it is still effective based on the experimental results shown in Table 8.
To validate the performance of the modules added in the AD-HRNet more intuitively, we visualize the experimental results by using the Postdam dataset (see Figure 10). For semantic segmentation task of remote sensing images, it is very difficult to identify the shape and boundary of incomplete multi-scale objects and irregular objects. In Figure 10(c), we can find the original version of HRNet model cannot solve this problem well. By using dilated convolution to expand the receptive field of the convolution kernel to obtain more local information and applying dense upsampling convolution to reduce the information loss during upsampling, HRNet with the MDC-DUC module can obtain edges of segmented objects with a better accuracy, as shown in Figure 10(d). For AD-HRNet, we proposed to use the attention mechanism to obtain the global information of images. In Figure 10(e), we can find that the HRNet with the proposed shuffle-CBAM can identify objects with different scales, especially small objects.
The experimental results shown in Table 9 indicate that the improved weighted cross-entropy loss function can get a better result for identifying small objects by increasing the value of mIoU and F1 score to 0.30% and 0.45%, respectively, based on Vaihingen dataset. By adding the Shuffle-CBAM module to the framework of HRNet can increase the mIoU and F1-Score by 0.82% and 0.8%, respectively. And after adding the MDC-DUC module, the mIoU and F1-Score of HRNet are increased by 0.75% and 0.62%, respectively. However, the statistics also indicated that the precision of our model by using Vaihingen dataset is slightly lower than the original version of HRNet. It is because our model mainly solved the problem of class imbalance, which may reduce the accuracy of some large objects and lead to a slight decrease in the precision. Based on the existing studies, most of researchers pay more attention to improve the value of mIoU and F1-Score during semantic segmentation for remote sensing images. Although the precision of AD-HRNet model based on Vaihingen dataset is slightly lower, it significantly improved the value of mIoU and F1-Score by 1.51% and 1.56%, respectively. Figure 11 illustrates the visualization results of the ablation experiment by using Vaihingen dataset. As we can see from Figure 11, the proposed modules including MDC-DUC and Shuffle-CBAM are effective in identifying small targets and recognizing the edges of multi-scale objects (see Figure   Table 8 11(d,e)). For example, an unrecognized small Building based on HRNet is identified when we added the MDC-DUC module to the framework of HRNet (Figure 11(d)). The visualized result shown in Figures 10 and 11 indicates that the proposed AD-HRNet by embedding with the constructed modules of Shuffle-CBAM and MDC-DUC not only improves the ability of recognizing objects with various scales (especially for small objects), but also enhances the ability of edge recognition.  The experiments conducted on Potsdam and Vaihingen datasets also show that the inference speed of GFLOPs and single image by using our model is slightly slower than HRNet (see Tables  8 and 9). In this study, we focused on improving the performance of model in the aspect of mIoU and F1 score and didn't pay more attention on how to reduce the time cost during semantic segmentation. Thus, future work for addressing this issue is still needed.

Generalization experiment
In this study, our model was trained by using Postdam and Vaihingen datasets. Since the training set of SAMA-VTOL is not public to users, our model cannot be trained by images with geometric features and radiation characteristics which existed in SAMA-VTOL. Thus, we applied the testing image provided by SAMA-VTOL dataset to validate the generalization of AD-HRNet. The experimental result is shown in Figure 12. As we can see from Figure 12, the proposed model AD-HRNet can effectively recognize the targets with the class of Impervious surfaces, Building, Low vegetation, Tree, Car, and Clutter. These unrecognized or misidentified objects are basically categories that do not exist in the training sets, such as building shadows.

Conclusion
In this study, we designed a novel model called as AD-HRNet by combing the HRNet with attention mechanisms and dilated convolution for the semantic segmentation of remote sensing data. In the framework of AD-HRNet, two new modules: Shuffle-CBAM and MDC-DUC, are constructed by us to improve the segmentation accuracy. The Shuffle-CBAM module in AD-HRNet model is integrated by channel attention and spatial attention to use for obtaining more global context information of images by slightly increasing the amount of computation. The MDC-DUC module can better capture the information of multi-scale objects and the edge information of a large number of irregular objects by using the characteristics learned through dense upsampling convolution after expending the receptive field of convolution kernel based on the dilated convolution. In addition, we used the median frequency balance method to obtain the weight of each class and added it to the weighted cross-entropy function which is used to replace the original cross-entropy function of HRNet. This improvement solves the class imbalance problem to some extent. Our proposed AD-HRNet was comprehensively evaluated on two widely used datasets (Potsdam and Vaihingen) in the research field of remote sensing semantic segmentation. The statistics shown that it achieved mIoUs of 75.59% and 71.58%, respectively, and outperforming other existing benchmark models, e.g. HRNet, Deeplab, OCRNet, and so on. The outcomes of ablation experiments also showed how well our suggested module performs in remote sensing semantic segmentation tasks. We also validate the generalization ability of AD-HRNet by using SAMA-VTOL dataset. The experimental result shown that the proposed AD-HRNet can recognize objects from the image collected from SAMA-VTOL without their own data training. However, limitations are still existed. For example, the comparison experiments conducted by Potsdam and Vaihingen datasets illustrate that the time cost is increased by using our model. Therefore, in the future work, we will study how to reduce the cost of inference time as much as possible while improving the accuracy of the model.