ABCNet: Attentive Bilateral Contextual Network for Efficient Semantic Segmentation of Fine-Resolution Remote Sensing Images

Semantic segmentation of remotely sensed images plays a crucial role in precision agriculture, environmental protection, and economic assessment. In recent years, substantial fine-resolution remote sensing images are available for semantic segmentation. However, due to the complicated information caused by the increased spatial resolution, state-of-the-art deep learning algorithms normally utilize complex network architectures for segmentation, which usually incurs high computational complexity. Specifically, the high-caliber performance of the convolutional neural network (CNN) heavily relies on fine-grained spatial details (fine resolution) and sufficient contextual information (large receptive fields), both of which trigger high computational costs. This crucially impedes their practicability and availability in real-world scenarios that require real-time processing. In this paper, we propose an Attentive Bilateral Contextual Network (ABCNet), a convolutional neural network (CNN) with double branches, with prominently lower computational consumptions compared to the cutting-edge algorithms, while maintaining a competitive accuracy. Code is available at https://github.com/lironui/ABCNet.


INTRODUCTION
Profit from the rapidly expanding Earth Observation technique, a large amount of remotely sensed images with fine spatial and spectral resolutions are now available for a wide range of application scenarios such as image classification (Lyons et al., 2018;Maggiori et al., 2016), object detection (Li et al., 2017;Xia et al., 2018), and semantic segmentation (Kemker et al., 2018;Zhang et al., 2019a).The revisiting property of orbital acquisitions brings the consecutive monitoring of land surface, ocean, and atmosphere into the possibility (Duan and Li, 2020).Fineresolution remote sensing images normally contain substantial detailed spatial information for land cover and land use (Duan et al., 2020).Semantic segmentation, which assigns each pixel in images with a definite category, has become one of the most crucial levers for ground object interpretation.Specifically, semantic segmentation from remotely sensed imagery plays a pivotal role in various scenarios including precision agriculture (Griffiths et al., 2019;Picoli et al., 2018), environmental protection (Samie et al., 2020;Yin et al., 2018), and economic assessment (Zhang et al., 2020;Zhang et al., 2019a).Looking from a panoramic view, semantic segmentation is one of the high-level tasks that paves the way for complete scene understanding.Hence, semantic segmentation is at the forefront of a comprehensive effort towards automatic Earth monitoring by international agencies.
To identify the image content from various land cover and land use categories, tons of approaches explored the utilization of spectral and spectral-spatial features to interpret remote sensing images (Gong et al., 1992;Ma et al., 2017;Tucker, 1979;Zhong et al., 2014;Zhu et al., 2017).However, the finite ability to capture the contextual information contained in the images restricts the flexibility and adaptability of these methods (Li et al., 2020c;Tong et al., 2020), especially when the detailed and structural information surged by the increased spatial resolution.
By contrast, bolstered by its powerful capabilities to capture nonlinear and hierarchical features automatically, deep Convolutional Neural Network (CNN) has posed a significant impact on the understanding of fine-resolution remote sensing images (Li et al., 2020a;Zheng et al., 2020).
For semantic segmentation, Fully Convolutional Network (FCN) (Long et al., 2015) is the first proven and effective end-to-end CNN structure.Restricted by the oversimple design of the decoder, the results of FCN, although very encouraging, appear coarse.Subsequently, the more elaborate encoder-decoder structure (Badrinarayanan et al., 2017;Ronneberger et al., 2015) is proposed which comprises two symmetric paths: a contracting path for extracting features and an expanding path for exact positioning to accomplish more accurate results.To guarantee the accuracy of segmentation, global contextual information and multiscale semantic features are supposed to be thoroughly utilized for semantic categories with varying sizes in images.By the spatial pyramid pooling module, the pyramid scene parsing network (PSPNet) (Zhao et al., 2017) aggregates contextual information among different regions.The dual attention network (DANet) (Fu et al., 2019) applies the dot-product attention mechanism to extract abundant contextual relationships.Subject to the enormous memory and computational consumptions, DANet simply attaches the dot-product attention mechanism at the lowest layer and merely captures the longrange dependencies from the smallest feature maps.DeeplabV3 (Chen et al., 2017) adopts atrous convolution to mining multiscale features, while a simple yet valid decoder module is added in DeepLabV3+ (Chen et al., 2018a) to further refine the segmentation results.
The extraction of global contextual information and the exploitation of large-scale feature maps are computationally expensive (Duan and Li, 2020;Li et al., 2020b).Therefore, a series of lightweight networks (Hu et al., 2020;Oršić and Šegvić, 2021;Romera et al., 2017;Yu et al., 2018;Zhuang et al., 2019) are developed to accelerate the computational speed while keeps the equilibrium between accuracy and efficiency.For example, the asymmetric convolution which is used in ERFNet (Romera et al., 2017) factorizes the standard 3 × 3 convolutions into a 1 × 3 convolution and a 3 × 1 convolution, saving about 33% computational consumptions.By exploiting spatial correlations and cross-channel correlations respectively, BiseNet (Yu et al., 2018) utilizes the depth-wise separable convolution (Chollet, 2017) which further lowers the consumption of the standard convolution.Multi-scale encoder-decoder branch pairs with skip connections are studied in ShelfNet (Zhuang et al., 2019) where a shared-weight strategy is harnessed in the residual block to reduces the parameter without sacrificing accuracy.For implementing the non-local context aggregation, FANet (Hu et al., 2020) employs the fast attention module in efficient semantic segmentation.SwiftNet (Oršić and Šegvić, 2021) explores the effectiveness of pyramidal fusion in compact architectures.
Due to limited capacity in extracting the global context information, there is a huge gap in accuracy between the lightweight networks and the state-of-the-art models, which is especially true when it comes to the fine-resolution remotely sensed images.As a powerful approach that can capture long-range dependencies, the dot-product attention mechanism (Vaswani et al., 2017) is a plausibly ideal solution to remedy this limitation.Whereas, the memory and computational consumptions of the dot-product attention mechanism increase quadratically with the spatiotemporal size of the input, which runs counter to the original intention of lightweight networks.
Encouragingly, our previous work about linear attention (Li et al., 2020a) which reduces the complexity of the dot-product attention mechanism from ( 2 ) to () alleviates this plight.In this paper, we aim to further improve the segmentation accuracy while simultaneously ensuring the efficiency of semantic segmentation.We approach this challenging problem by modeling the global contextual information using the linear attention mechanism.To be specific, we proposed an Attentive Bilateral Contextual Network (ABCNet) to address the efficient semantic segmentation of fine-resolution remote sensing images.Following the design philosophy of BiSeNet (Yu et al., 2018), there are two branches in the proposed ABCNet: a spatial path to retain affluent spatial details and a contextual path to capture global contextual information.
Compared with the encoder-decoder structure (Fig. 1(a)), the bilateral architecture (Fig. 1(b)) can maintain more spatial information without retarding the speed of the model (Yu et al., 2018).
Concretely, the spatial path merely stacks three convolution layers to generate the 1/8 feature maps, while the contextual path includes two attention enhancement modules (AEM) to refine the features and capture contextual information.As features generated by two paths are disparate in the level of feature representation, we further design a feature aggregation module (FAM) to fuse these features.Our main contributions are summarized as follows: 1) We propose a novel approach for efficient semantic segmentation of fine-resolution remote sensing images.Specifically, we propose an Attentive Bilateral Contextual Network (ABCNet) with a spatial path and a contextual path.
2) We design two specific modules, attention enhancement modules (AEM) for exploring long-range contextual information and feature aggregation module (FAM) for fusing features obtained by two paths.
3) We achieve competitive results on the ISPRS Vaihingen dataset and ISPRS Potsdam dataset.More specifically, we obtain the results of 91.095% overall accuracy on the Potsdam test dataset with a speed of 72.13 FPS even on a mid-range graphics card (1660Ti).

Related Work 1) Context information extraction
As the performance of semantic segmentation heavily hinges on the abundant context information, a great many endeavors are poured into tackling this issue.The dilated or atrous convolution (Chen et al., 2014;Yu and Koltun, 2015) has been demonstrated to be an effective technology for enlarging receptive fields without shrinking spatial resolution.Also, the encoderdecoder (Ronneberger et al., 2015) architecture which merges high-level and low-level features using skip connections is another valid way for extracting spatial context.Based on the encoderdecoder framework or dilation backbone, several subsequent studies focus on exploring the usage of spatial pyramid pooling (SPP) (He et al., 2015).For example, the pyramid pooling module (PPM) in PSPNet is composed of convolutions with kernels of four different sizes (Zhao et al., 2017), while DeepLab v2 (Chen et al., 2018a) equips with the atrous spatial pyramid pooling (ASPP) module which groups parallel atrous convolution layers with varying dilation rates.
However, there are still certain current limitations in SPP.The SPP with standard convolution will face a dilemma when expanding the receptive field by a large kernel size.The above operations are normally accompanied by a huge number of parameters.The SPP with small kernels (e.g.ASPP), on the other hand, lacks enough connection between adjacent features; and the gridding problem (Wang et al., 2018a), which occurs when the field is enlarged by a dilated convolutional layer.By contrast, the powerful ability to model long-range dependencies enable the dot-product attention mechanism to extract context information in the global scale.

2) Dot-Product Attention Mechanism
Let H, W, and  denote the height, weight, and channels of the input, respectively.The input feature is defined as  = [ 1 , ⋯ ,   ] ∈ ℝ × , where  =  ×  .Firstly, the dot-product attention mechanism utilizes three projected matrices   ∈ ℝ   ×  ,   ∈ ℝ   ×  , and   ∈ ℝ   ×  to generate the corresponding query matrix Q, the key matrix K, and the value matrix V: (1) Please note that the dimensions of the Q and K are supposed to be identical and all the vectors in this section are column vectors by default.Accordingly, a normalization function ρ is employed to measure the similarity between the i-th query feature    ∈ ℝ   and the j-th key feature   ∈ ℝ   as (     ) ∈ ℝ 1 .As the query feature and key feature are generated via different layers, the similarities between (     ) and (     ) are not symmetric.By calculating similarities between all pairs of pixels in the input feature maps and taking the similarities as weights, the dot-product attention mechanism generates the value at position i by aggregating the value features from all positions using weighted summation: Normally, the softmax is the frequently-used normalization function: where   indicates that the softmax is exploited along each row of the matrix   .
By modeling the similarities between each pair of positions of the input, the global dependencies in the features can be thoroughly extracted by the (  ) .The dot-product attention mechanism is firstly designed for machine translation (Vaswani et al., 2017), while the non-local module (Wang et al., 2018b) introduces and modifies it for computer vision (Fig. 2).
Based on the dot-product attention mechanism as well as its variants, a constellation of attentionbased networks has been proposed to tackle the semantic segmentation task.Inspired by the nonlocal module (Wang et al., 2018b), the Double Attention Networks ( 2 -Net) (Chen et al., 2018b), Dual Attention Network (DANet) (Fu et al., 2019), Point-wise Spatial Attention Network (PSANet) (Zhao et al., 2018), Object Context Network (OCNet) (Yuan and Wang, 2018), and Cooccurrent Feature Network (CFNet) (Zhang et al., 2019b) are proposed successively for scene segmentation by exploring the long-range dependency.
Fig. 2 The diagram of the dot-product attention modified for computer vision.
Even though the introduction of attention significantly boosts the performance on segmentation, the huge resource-demanding of dot-product critically hinders its application on large inputs.To be specific, for  ∈ ℝ ×  and   ∈ ℝ   × , the product between  and   belongs to ℝ × , leading to the ( 2 ) memory and computation complexity.Consequently, it is requisite to lower the high demand for computational resources of the dot-product attention mechanism.

3) Generalization and simplification of the dot-product attention mechanism
If the normalization function is set as softmax, the i-th row of the result matrix generated by the dot-product attention mechanism can be written as: Equation ( 4) can be rewritten and generalized to any normalization function as: , sim�  ,   � ≥ 0. (5) sim�  ,   � can be expanded as (  )  (  ) that measures the similarity between the   and   , whereupon equation ( 4) can be rewritten as equation ( 6) and be simplified as equation ( 7): ) , equation ( 5) is equivalent to equation ( 4).The vectorized form of equation ( 7) is: As the softmax function is substituted for sim�  ,   � = (  )  (  ) , the order of the commutative operation can be altered, thereby avoiding multiplication between the reshaped key matrix K and query matrix Q.In concrete terms, the product between ()  and V can be computed first and then multiply the result and Q, leading only () time complexity and () space complexity.The suitable (•) and (•) enable the above scheme to achieve the competitive performance with finite complexity (Katharopoulos et al., 2020;Li et al., 2020b).

4) Linear Attention Mechanism
In our previous work (Li et al., 2020a) we proposed a linear attention mechanism from another perspective that replaces the softmax function with the first-order approximation of Taylor expansion, which is shown as equation ( 9): To guarantee the above approximation to be nonnegative,   and   are normalized by  2 norm, thereby ensuring      ≥ −1: Thus, equation ( 5) can be rewritten as equation ( 11) and simplified as equation ( 12): The equation ( 12) can be turned into a vectorized form: can be calculated and reused for each query, time and memory complexity of the attention based on equation ( 13) is ().The validity and efficiency of the proposed attention have been testified through extensive ablation experiments and analysis (Li et al., 2020a).

5) Efficient semantic segmentation
For many applications, efficiency is critical, which is especially true for real-time (≥30FPS) scenarios such as autonomous driving.Therefore, recent researches have made great efforts to accelerate models for efficient semantic segmentation, which employs lightweight models or downsampling the input size.The utilization of lightweight convolutions (e.g., the asymmetric convolution and the depth-wise separable convolution) is a common strategy for designing lightweight networks (Romera et al., 2017;Yu et al., 2018).The downsampling of the input size is a trivial solution to speed up semantic segmentation which reduces the resolution of the input images, thereby leading to the loss of image details.To extract spatial details at original resolution, many methods further add a shallow branch, forming the two-path architecture (Yu et al., 2020;Yu et al., 2018).

1) Spatial path
Although both of them are crucial for the high accuracy of segmentation, it is actually impossible to reconcile the affluent spatial details with the large receptive field simultaneously.
Especially, in the term of efficient semantic segmentation, the mainstream solutions focus on down-sampling the input image or speeding up the network by channel pruning.The former loses the majority of spatial details, which the latter damages spatial details.By contrast, in the proposed ABCNet, we adopt the bilateral architecture (Yu et al., 2018) which is equipped with a spatial path to capture spatial details and generate low-level feature maps.Therefore, the rich channel capacity is essential for this path to encode sufficient spatial detailed information.
Meanwhile, as the spatial path merely focuses on the low-level details, the shallow structure with a small stride for this branch is enough.
Specifically, the spatial path comprises three layers as shown in Fig. 4(a).Each layer contains a convolution with stride = 2, followed by batch normalization (Ioffe and Szegedy, 2015) and ReLU (Glorot et al., 2011).Therefore, the output feature maps of this path are 1/8 of the original image, which encodes abundant spatial details resulting from the large spatial size.

2) Contextual path
In parallel to the spatial path, the contextual path is designed to extract high-level global context information and provide sufficient receptive field.To enlarge the receptive field, several networks take advantage of the spatial pyramid pooling with a large kernel, leading to the huge computation demanding and memory consuming.With the consideration of the long-range context information and efficient computation simultaneously, we develop the contextual path with the linear attention mechanism (Li et al., 2020a).
Concretely, in the contextual path as shown in Fig. 4(a), we harness the lightweight backbone (i.e., ResNet 18) (He et al., 2016) to down-sample the feature map and encode the high-level semantic information.Thereafter, we deploy two attention enhancement modules (AEM) on the tails of the backbone to fully extract the global context information.The features obtained by the last two stages are fused and fed into the feature aggregation module (FAM).

3) Feature aggregation module
The feature representation of the spatial path and the contextual path is complementary but in different domains (i.e., the spatial path generates the low-level and detailed feature, while the contextual path obtains the high-level and semantic features).Thus, the simple fusion schemes such as summation and concatenation are not appropriate manners to fuse information.In contrast, we design a feature aggregation module (FAM) to merge both types of feature representation with consideration of accuracy and efficiency.
As shown in Fig. 4(c), with two domains of features, we first concatenate the output of spatial path and context path.Thereafter, a convolution layer with batch normalization (Ioffe and Szegedy, 2015) and ReLU (Glorot et al., 2011) attached to balance the scales of the features.Then, we capture the long-range dependencies of the generated features using the linear attention mechanism.The details of the design of FAM can be seen in Fig. 4(c).

4) Loss function
As can be seen from Fig. 1(b), besides the principal loss function to supervise the output of the whole network, we utilize two auxiliary loss functions at the context path to accelerate the convergence velocity.We select the cross-entropy loss as the principal loss: where p is the prediction generated by the network, while y is the ground truth.The auxiliary loss functions are chosen as the focal loss: where γ is the focusing parameter, which controls the down-weighting of the easily classified examples and is set as 2 in our experiments.Hence, the overall loss of the network is: (, ) =   +  1 (, ) +  2 (, ). (16)

1) Datasets
The effectiveness of the proposed ABCNet is verified using the ISPRS Potsdam dataset, the

ISPRS Vaihingen dataset.
Potsdam: There are 38 fine-resolution images of size 6000 × 6000 pixels with a ground sampling distance (GSD) of 5 cm in the Potsdam dataset.The dataset provides near-infrared, red, green, and blue channels as well as DSM and normalized DSM (NDSM).We utilize ID: 2_13, 2_14, 3_13, 3_14, 4_13, 4_14, 4_15, 5_13, 5_14, 5_15, 6_13, 6_14, 6_15, 7_13 for testing, ID: 2_10 for validation, and the remaining 22 images, except image named 7_10 with error annotations, for training.Please note that we only employ the red, green, and blue channels in our experiments.

2) Evaluation Metrics
The performance of ABCNet is evaluated using the overall accuracy (OA), the mean Intersection over Union (mIoU), and the F1 score (F1).Based on the accumulated confusion matrix, the OA, mIoU, and F1 are computed as: where   ,   ,   , and   represent the true positive, false positive, true negative, and false negatives, respectively, for object indexed as class k.OA is computed for all categories including the background.

3) Experimental Setting
All of the training procedures are implemented with PyTorch on a single Tesla V100 with 32 batch size, and the optimizer is set as AdamW with a 0.0003 learning rate.For training, the raw images are cropped into 512 × 512 patches and augmented by rotating, resizing, horizontal axis flipping, vertical axis flipping, and adding random noise.The comparative methods include the contextual information aggregation methods designed initially for natural images, such as pyramid scene parsing network (PSPNet) (Zhao et al., 2017) and dual attention network (DANet) (Fu et al., 2019), the multi-scale feature aggregation models proposed for remote sensing images, like multi-stage attention ResU-Net (MAResU-Net) (Li et al., 2020a) and edge-aware neural network (EaNet) (Zheng et al., 2020), and also lightweight network developed for efficient semantic segmentation including depth-wise asymmetric bottleneck network (DABNet) (Li et al., 2019), efficient residual factorized convNet (ERFNet) (Romera et al., 2017), bilateral segmentation network V1 (BiSeNetV1) (Yu et al., 2018) and V2 (BiSeNetV2) (Yu et al., 2020), fast attention network (FANet) (Hu et al., 2020), ShelfNet (Zhuang et al., 2019), and SwiftNet (Oršić and Šegvić, 2021).The test time augmentation (TTA) in terms of rotating and flipping is applied for all comparative methods.

4) Ablation study
To verify the effectiveness of the components in the proposed ABCNet, we conduct extensive ablation experiments.atmosphere conditions, while the setting details and quantitative results are listed in Table 1.
Baseline: We utilize the ResNet-18 as the backbone of the contextual path and select the contextual path without the AEM (denoted as CP in Table I) as the baseline.The feature maps generated by CP are directly up-sampled to the shape as the original input image.

Ablation for attention enhancement module:
For capturing the global context information, we specially design an attention enhancement module (AEM) in the contextual path.As presented in Table I, for two datasets, the utilization of AEM (indicated as Cp + AEM) brings more than 1.5% improvement in mIoU.
Ablation for the spatial path: As the affluent spatial information is crucial for semantic segmentation, the spatial path is designed for preserving the spatial size and extracting spatial information.Ablation for feature aggregation module: Given the features obtained by the spatial path and the contextual path are in different domains, neither summation nor the concatenation is the optimal fusion scheme.As can be seen from Table I, the significant gap of performance explains the validity of the feature aggregation module (signified as Cp + Sp + AEM + FAM).

5) The complexity and speed of the network
The complexity and speed are momentous factors for measuring the merit of an algorithm, which is especially true for practical application.For a thorough comparison, we implement our experiments under different settings.First, the comparison of parameters and computational complexity between different networks are reported in Table Ⅱ, where 'G' indicates Gillion (i.e., the unit of floating point operations) and 'M' signifies Million (i.e., the unit of parameter number).
Meanwhile, for a fair comparison, we choose 256×256, 512×512, 1024×1024, 2048×2048, and 4096×4096 as resolutions of the input image and report the inference speed which is measured by frames per second (FPS) on a midrange notebook graphics card 1660Ti.
The proposed ABCNet simultaneously juggles both speed and accuracy.As can be seen from the last column of Table Ⅱ, the mIoU on the Potsdam dataset achieved by the ABCNet is at least 1.79% higher than the comparative methods.Meanwhile, the ABCNet could maintain a 72.13 FPS speed for a 512×512 input.Besides, the elaborate design enables the ABCNet to handle the massive input (4096×4096), while more than half of the comparative methods run out of memory for a such large input.

6) Results on the ISPRS Vaihingen dataset
The ISPRS Vaihingen is a relatively small dataset.Besides, there is a small covariate shift between training and test sets (Ghassemi et al., 2019).Therefore, the high performance can be easily achieved by specifically designed networks, especially for those fuse orthophoto (TOP) images with auxiliary DSM or NDSM.In this part, we will show that our ABCNet model using only TOP images with efficient architecture can not only also transcend lightweight networks but also achieve competitive performance with those specially designed models.
As shown in  an 85.299% F1 score, which is at least 4% higher than other methods.To further evaluate the statistical significance, we report Kappa z-test for pairwise methods based on Kappa coefficients of agreement and their variances using the following equation: where k signifies the Kappa coefficient and v denotes the Kappa variance.Concretely, if the value of  is greater than 1.96, the two algorithms are signally different at the 95 % confidence level.As can be seen from Table Ⅳ, the accuracy of the proposedABCNet is statistically higher than other comparative methods.In addition, we visualize area 38 in Fig. 5 to qualitatively demonstrate the effectiveness of our ABCNet, while the enlarged results are shown in Fig. 7 (a).
For a comprehensive evaluation, ABCNet is also compared with other state-of-the-art methods.
As can be seen in Table Ⅴ, as a lightweight network, the proposed ABCNet achieves a competitive performance even compared with those designed models with complex structures.It is worth noting that the speed of our ABCNet is two to seven times faster than those methods.

Fig. 1
Fig.1 Illustration of (a) the encoder-decoder structure and (b) the bilateral architecture.

Fig. 3
Fig.3 The (a) computation requirement and (b) memory requirement between the linear

Fig. 4
Fig.4 An overview of the Attentive Bilateral Contextual Network.(a) Network Architecture.

Fig. 6
Fig.6 Mapping results for test images of Potsdam tile-3_13.

Fig. 7
Fig.7 Enlarged visualization of results on (LEFT) the Vaihingen dataset and (RIGHT) Table I demonstrated that even the simple fusion schemes such as summation

TABLE I ABLATION
STUDY OF EACH COMPONENT IN OUR PROPOSED ABCNET

TABLE Ⅱ
THE COMPLEXITY AND SPEED OF THE PROPOSED ABCNET AND COMPARATIVE METHODS.
* means the network is out of memory.

TABLE Ⅲ ,
the numeric scores for the ISPRS Vaihingen test dataset demonstrated that our ABCNet delivers robust performance, and exceeded other lightweight networks in the mean F1, OA, and mIoU by a considerable margin.Significantly, the ''car'' class in Vaihingen dataset is difficult to handle as it is a relatively small object.Nonetheless, our ABCNet acquires

TABLE Ⅲ
QUANTITATIVE COMPARISON RESULTS ON THE VAIHINGEN TEST SET.

TABLE Ⅳ
KAPPA Z-TEST COMPARING THE PERFORMANCE OF DIFFERENT METHODS ON THE VAIHINGEN DATASET.

TABLE Ⅴ
QUANTITATIVE COMPARISON RESULTS ON THE VAIHINGEN TEST SET WITH STATE-OF-THE-ART METHODS.
# means the DSM or NDSM are used in the network.
We carry out experiments on the ISPRS Potsdam dataset to further evaluate the performance of ABCNet.Numerical comparisons with other lightweight methods are shown in TableⅥ, while the Kappa-z test is illustrated in TableⅦ.Remarkably, ABCNet achieves 91.095% in overall accuracy and 88.561% in mIoU, and the Kappa-z test strongly illuminates the superiority contrasted with other lightweight networks.The visualization of area 3_13 is displayed in Fig.6, and the enlarged results are exhibited in Fig.7 (b).As there are sufficient images in the Potsdam dataset to train the network, the performance of the ABCNet can be parity with the state-of-the-art methods with a much faster speed.The comparisons are illustrated in TableⅧ.

TABLE Ⅵ
QUANTITATIVE COMPARISON RESULTS ON THE POTSDAM TEST SET.

270 96.798 87.814 88.687 95.921 92.498 91.095 88.561 the
contextual path.In particular, we design an attention enhancement module to model longrange dependencies from extracted feature maps.Additionally, to address the feature fusion issue and improve the effectiveness, a feature aggregation module is presented to adequately merge the detailed features captured by the spatial path and semantic features generated by the contextual path.Extensive experiments on ISPRS Vaihingen Potsdam datasets demonstrate the effectiveness and efficiency of the proposed ABCNet.

TABLE Ⅶ
KAPPA Z-TEST COMPARING THE PERFORMANCE OF DIFFERENT METHODS ON THE POTSDAM DATASET.

TABLE Ⅷ
QUANTITATIVE COMPARISON RESULTS ON THE POTSDAM TEST SET WITH STATE-OF-THE-ART METHODS.
# means the DSM or NDSM are used in the network.