Feature Guide Network With Context Aggregation Pyramid for Remote Sensing Image Segmentation

In recent years, the deep learning method based on fully convolution networks has proven to be an effective method for the semantic segmentation of remote sensing images (RSIs). However, the rich information and complex content of RSIs make networks training for segmentation more challenging. Specifically, the observing distance between the space-borne cameras and the ground objects is extraordinarily far, resulting in that some smaller objects only occupy a few pixels in the image. However, due to the rapid degeneration of tiny objects during the training process, most algorithms cannot properly handle these common small objects in RSIs with satisfactory results. In this article, we propose a novel feature guide network with a context aggregation pyramid (CAP) for RSIs segmentation to conquer these issues. An innovative edge-guide feature transform module is designed to take advantage of the edge and body information of objects to strengthen edge contours and the internal consistency in homogeneous regions, which can explicitly enhance the representation of tiny objects and relieve the degradation of small objects. Furthermore, we design a CAP pooling strategy to adaptively capture optimal feature characterization that can assemble multiscale features according to the significance of different contexts. Extensive experiments on three large-scale remote sensing datasets demonstrate that our method not only can outperform the state-of-the-art methods for objects of different scales but can also achieve robust segmentation results, especially for tiny objects.


I. INTRODUCTION
T HE semantic segmentation of remote sensing imagery, which assigns a unique category label for each pixel in space-captured images of earth, is a fundamental component for infrastructure planning, territorial change detection, and environmental monitoring [1], [2], [3]. Unlike natural images, the semantic segmentation of subdecimeter aerial images faces many challenges due to the complex remote sensing contents and illumination conditions, making it challenging to obtain fine-grained semantic segmentation results [4], [5].
A major challenge is that the vast imaging range of aerial satellites results in the remote sensing scenes usually being accompanied by high complexity and diversity so that the ground objects in the scene have very great and different geometric shapes, sizes, and textures, bringing about great difficulties to the feature extraction of remote sensing images (RSIs) [6], [7]. In addition, the downsampling operation in deep convolutional neural networks gradually leads to the disappearance of details in the original image, which may lose a lot of pivotal information and result in unsatisfactory segmentation prediction, especially for boundary details and small objects [8]. A generally adopted method to handle this drawback is to fuse the low-level features containing more detailed information and the high-level features with more semantic information by a decoder structure. For instance, fully convolution networks (FCNs) connect the feature maps from shallow layers to the output feature for gaining more spatial location information. UNet combines the features from the corresponding low layers during every upsampling operation in the resolution recovery stage [9]. DeepLabv3+ [10], as an outstanding framework, embeds the features from the first residual convolution group into the refined features. However, these methods handle all the information, such as colors, textures, and edges, in the same way, which ignores the disparate contributions of specific information to the task of semantic segmentation.
Another challenge is the severe category imbalance among different classes in the RSI, which is relatively rare in natural images but quite common in space-captured images of Earth. Furthermore, the distinction between interclass is relatively small. Specifically, roads and cars are ordinarily very tiny and On the left-hand side is a remote sensing image, and on the right-hand side is the corresponding semantic label, where yellow represents the vegetation category, green represents roads, red represents settlements, and black represents background. Here, we show a typical example where roads and settlements are few and far between in the scene and are surrounded by dominant vegetation. occupy only a dozen pixels in the image. Even worse, these tiny objects are surrounded by dominant ground objects, which take up the vast majority of all pixels. Fig. 1 demonstrates an example of these challenges in the RSI scenario. Compared with the vegetation class denoted as a yellow color, the roads are marked with a green color, and settlements denoted as a red color account for an extremely small proportion. In addition, roads and settlements are easily confused with vegetation because they are surrounded by vegetation. The most straightforward strategies for handling class imbalance in semantic segmentation include selective data augmentation and decision level fusion through multiple networks results [11], [12], [13], [14], [15], [16]. These strategies, however, are either overfitting-prone or too heavy for practical application. On the other hand, many current methods combine multiscale representations by relying on the skip connections of typical encoder-decoder network architectures [9], [10], [17], [18], [19]. However, this type of feature extraction does not explicitly enhance the representation of small or especially tiny objects, making them still prone to being easily degraded by large objects with a dominant number of pixels, as shown in Fig. 2. The tiny targets are marked in a red box, and it is evident that the predictions of PSPNet almost lose the tiny targets [20].
To further conquer the abovementioned issues, the present studies propose enhancing the characterization of small or tiny objects explicitly during network training in this work. Specifically, efficient convolution layers are introduced to extract edge and body structure information in the decoder part of a network. Fig. 3. Visualized body and edge features. Edge and body can enhance nonobvious characteristics in RSIs, which is benefit to the segmentation especially for the tiny structure objects, such as roads and cars. Fig. 3 depicts the visualized body and boundary features through the designed feature extraction module of the present study, illustrating that the less sharp objects, such as roads, cars, and so on, in the original image are more distinguishable in body and edge features. A practical edge-guided feature transform module (EGFTM) is designed to utilize features to strengthen the edge contour and maintain the internal consistency of a homogeneous area. It operates in a coarse-to-fine manner to progressively enhance the spatial structure of small objects. Furthermore, a context aggregation pyramid (CAP) pooling strategy is devised in the present study to adaptively achieve optimal fusion characterization of multiscale features according to different contexts. The strategies proposed in the present study are general and can be applied in any off-the-shell segmentation network architectures. The proposed network is tested on three large-scale remote sensing datasets, and it is demonstrated that it can consistently improve the accuracy regardless of the network backbones. Moreover, the proposed method outperforms most state-of-the-art (SOTA) methods, especially in segmenting small objects.
The main contributions of the current work can be summarized as follows.
1) A novel RSI semantic segmentation framework feature guide network with context aggregation pyramid (FGN-CAP) based on ResNet backbone is proposed, which can effectively alleviate the degradation of small objects and outperforms other baselines with SOTA results on three RSIs benchmark. 2) An EGFTM is designed, which utilizes the edge and body features extracted in the encoder part to enhance the heterogeneous boundaries and constrain the internal context consistency of homogeneous region, resulting in better parsing of small or tiny targets that are surrounded by dominant objects. 3) CAP is further proposed for improving the segmentation performance, which concentrates on dynamically learning individual superior multiscale context characterizations for each object of varying size. Particularly, a corresponding affine matrix is exploited to modulate the significance of different scale features. The rest of this article is organized as follows. Section II concisely introduces the related works. Our proposed method is described in detail in Section III. Section IV presents sufficient experiments with analysis. Section V concludes this article.

A. General Semantic Segmentation
Traditional semantic segmentation methods heavily rely on conditional random fields, which is still a common refinement component for some current methods [21], [22], [23], [24]. These methods, however, are sensitive to scale variations and are less efficient. To overcome these, many methods try to capture contextual information by different strategies [9], [10], [17], [20]. Furthermore, most current methods adopt a nonlocal operator and self-attention mechanism to obtain more detailed context [18], [25], [26], [27], [28]. There are also some attempts by introducing a graph convolutional network to propagate the context information between different regions [19], [29]. These methods, however, do not explicitly enhance the context of tiny objects, making them prone to being easily affected by other larger objects during context propagation.
Recently, aggregating multiscale features based on pyramid network architectures has become a dominant strategy for improving semantic segmentation accuracy [30]. SegNet relies on the common encoder-decoder architecture to combine the low-level and high-level features [31]. PSPNet and Deeplabv3 propose custom modules to capture multiscale contexts [17]. These methods, however, do not explicitly enhance the representation of tiny objects and still cannot prevent their degeneration, especially in the scenario with other dominant categories. In contrast, the authors of the present study propose to explicitly enhance tiny objects' representation with spatial structures, such as edges, to facilitate the aggregation of multiscale features.
Leveraging other tasks by joint learning is also a recent direction for semantic segmentation, such as GSCNN and DSRL [32], [33], which embed into the main network as a subnetwork for shape estimation and super-resolution estimation, respectively. The proposed network conducts standard semantic segmentation tasks while extracting edge body features and using joint loss for supervised learning.

B. Semantic Segmentation in RSIs
Semantic segmentation on RSIs involves classifying houses, roads, vegetation, water, farmland, and more land-cover classes with pixel-level precision. Early research was focused on unsupervised learning based on graphic theory [34], [35], [36]. Recently, semantic segmentation models based on modern deep learning, such as FCNs [37], have dramatically improved the accuracy in most major RSI benchmarks. Volpi and Tuia [38] proposed a CNN architecture based on encoder-decoder to enhance results by deconvolution. Sun et al. [39] proposed ensemble strategies and a residual architecture to target the structural stereotype and insufficient learning in the encoder-decoder framework. However, although the overall accuracy (OA) is increased through the capacity of deep feature representation, the accuracy of some tiny objects is still not satisfactorily improved. In practice, there are many tiny-structured objects in an RSI, including some point objects, such as towers and chimneys, and some linear targets, such as roads and small streams. To further improve the accuracy of the tiny objects in RSIs, the main methodologies following general semantic segmentation are proposed for several specific application scenarios [11], [40], [41], [42], which fuse different context information and extract foreground information and long-short dependence of spatial data for assisting with segmentation accuracy enhancement. Zhang et al. [43] proposed an end-to-end attention-based semantic segmentation network, and a pyramid attention pooling module was designed to introduce the attention mechanism into the multiscale module for adaptive feature refinement. Bai et al. [44] combined multiscale with the attention mechanism, and proposed a multiscale attention module to enhance the fine-grained representation ability of the network and the extraction ability of global context information. However, these existing semantic methods in RSIs handle all the context information, such as color, texture, and edges, in the same way, which ignores the disparate contributions of different information. Furthermore, due to overlapping between categories, the intraclass variance is large, whereas the interclass variance is small. Moreover, the discrimination between the categories is not apparent. The irregularity and complexity of the boundary shapes make it difficult to achieve semantic segmentation near the boundary. In particular, the edge suffers from difficult distinction for the tiny targets often surrounded by different land covers having a large scale. Therefore, to overcome this, the present study proposes an efficient module enhancing the network's sensitivity to the edge structures, which is critical for segmenting tiny objects. The proposed approach considers the body consistency and the edge preservation of the object in images as vital auxiliary information in the segmentation task. With the assistance of body and edge features, the resolution of the feature image is reconstructed from coarse to fine layer by layer in the decoder module. A simple module to extract the body and edge features through supervised learning is adopted to reconstruct the output resolution from coarse to fine with the assistance of body and edge features.

C. Edge Detection
The ideal result of semantic segmentation is the accurate edge division of different objects in the image, so edge detection and semantic segmentation have a certain degree of consistency. Based on the similarity and synergy between edge detection and semantic segmentation, some studies use edge detection tasks to enhance the segmentation results of image edges and improve the overall segmentation accuracy. Bertasius et al. [45] used a "high-for-low" method to predict the boundary using the object level features in the pretrained object classification network. In this method, high-level object features inform the low-level boundary detection process. Kokkinos [46] applied deep convolution neural network training to the boundary detection task. The carefully designed boundary detection loss training and multiresolution architecture improved the boundary detection technology. Bertasius et al. [47] used intermediate features for semantic segmentation, predicted edges through domain transformation, and optimized the quality of target semantic Fig. 4. Our proposed network architecture. The network is based on encoder-decoder structure. In the encoder stage, edge and body information in the feature maps are extracted, respectively, constrained by corresponding loss function. The backbone is followed by a CAP to aggregate multiscale context features adaptively. In the decoder stage, an EGFTM is used to help with upsampling to restore resolution. segmentation. Marmanis et al. [48] proposed an end-to-end trainable deep convolution neural network structure for semantic segmentation and built in the perception of semantic meaningful boundaries. First, a relatively simple and memory-efficient model is constructed by adding boundary detection to the SegNet encoder-decoder architecture. Second, boundary detection is added to the FCN type model, and a high-end classifier ensemble is established. Combining semantic segmentation and semantic information edge detection to combat this influence clarifies the class boundary in the model. Yang et al. [49] proposed a new end-to-end edge-aware network EANet to obtain accurate buildings from aerial images. Specifically, the architecture comprises image segmentation and edge perception networks, which are responsible for constructing prediction and edge surveys, respectively. The network pays more attention to low-level details, such as edges, rather than emphasizing the multiscale fusion of features, or enhancing more receptive fields to obtain global features. Yu et al. [50] proposed a distinct feature network that consists of two subnetworks: smooth network and boundary network. Specifically, to deal with the problem of intraclass inconsistency, the smooth network selects more discriminative features through a channel attention block and global average pool. The boundary network can distinguish the bilateral characteristics of the boundary through deep semantic boundary supervision. Zhao et al. [7] designed an auxiliary edge detection task to provide edge constraints for semantic segmentation.
The present study introduces edge detection tasks into semantic segmentation to take advantage of their synergy. Furthermore, an end-to-end semantic segmentation network structure is proposed. Considering that different features of objects in the image content have other benefits for semantic segmentation tasks, the proposed method explicitly extracts edge and body features in the network and optimizes semantic segmentation results by using edge constraints and internal consistency of the body.

III. METHODOLOGY
In this section, the proposed semantic segmentation CNN architecture is proposed. The proposed EGFTM and a CAP module are also described. Then, the loss function exploited in the network training is introduced. Finally, the architecture of the proposed framework is depicted, yielding robust and superior segmentation results.

A. Network Architecture
As depicted in Fig. 4, the proposed network is based on an encoder-decoder structure. The backbone of the encoder module is ResNet [51], which extracts and refines features gradually through multiple groups of residual blocks. A CAP module follows the backbone, which focuses on capturing the objects with multiscales in the image and constructing adaptive contextual representations. Considering the disparity between edge and

B. Context Aggregation Pyramid Module
It has been widely demonstrated that context information is critical for scene parsing and semantic segmentation in many works [10], [20]. For complex remote sensing scenes, many objects with uneven scales are included. For example, a telegraph pole occupies only one or two pixels in the image whereas a lake can even take up the whole RSI. Objects of different scales need to aggregate contextual information of either a long or short range. The self-attention mechanism is introduced into the pyramid structure, and a CAP is presented to build a learning model for multiscales objects that ignores irrelevant information and focuses on significant information [18], [26]. Here, the most appropriate context representation is constructed in line with dynamical significance for each scale in the scale pyramid.
Such as PPM in PSPNet, the proposed CAP also constructs a multiscale pooling pyramid to capture contexts. As exhibited in Fig. 5, a pooling features pyramid is first created by different pooling scales s. There is an affine matrix for each scale, where the optimal feature representation at this scale can be calculated by the pooling feature and the corresponding affine matrix. Here, one scale s is taken as an example, and the other scales can be processed similarly. Given an input image I for segmentation, the feature map F ∈ R h×w×c can be calculated by the backbone CNN, where h × w denotes spatial resolution and c represents channel dimension. For the scale s, F is first processed with a 1 × 1 convolution to obtain the reduced feature map x s ∈ R h×w×k . Then, x s is processed in two parallel branches. In the first branch, the dimension of x, s is first reduced into h × w × s 2 by 1 × 1 convolution to acquire the affine matrix, it is then reshaped into hw × s 2 . In the second branch, x s is transformed into s × s × k by adaptive pooling, and it is reshaped into s 2 × k. Then, the adaptive weight w s for every pixel in x s can be obtained by the matrix product of the outputs of two branches, which can be expressed as the following formula: where φ and ϕ represent 1 × 1 convolution and ρ represents adaptive pooling. Finally, the representation z s can be calculated as follows: The abovementioned steps are also performed at other scales in the scale pyramid. All the z s i at all the scales and the feature map F are concatenated finally so that the refined representation Z that is adaptive to all objects can be obtained. The formula is as follows:

C. Edge-Guided Feature Transform Module
In RSIs, overlapping between categories occurs due to imaging technology, leading to a false alarms problem. Furthermore, for the tiny targets surrounded by different land covers on a large scale, the edge is difficult to distinguish from the irregularity and complexity of the boundary shapes. Thus, it is valuable to propose an efficient decoder structure to support the network, which precisely maintains the edge features and consistently maintains the body feature.
In CNNs, to increase the receptive field of output, the spatial resolution of output will be lost. Therefore, upsampling is needed to restore the feature map to the original spatial resolution. Considering that the pixels inside an object in the scene are similar due to homogeneity in most cases while those distributed near the boundary with possible heterogeneity show differences, an EGFTM is designed in the present study to learn the edge and body feature representations with supervision. These edge and body features are leveraged to restore the output resolution according to their different constraint effects.
Specifically, as demonstrated in Fig. 6, the edge feature F e is explicitly extracted from lower encoder layers and the body feature F b from higher encoder layers by a simple convolution operation with supervision [53]. For the body features, F b , which mainly contain high-level semantic information, is adopted to calculate channel weights and aggregate inner semantic consistency context information with a global average pooling (GAP) operation. The edge features F e , which have more significant values at the boundary but smaller values elsewhere, are leveraged to reinforce the edges. The modified feature F m can be calculated as follows: where ζ 1 and ζ 2 mean two groups of Convolution → Batchnorm → ReLU. F denotes the refine feature output by the CAP module. The semantic predictions can be obtained by up-sampling F m .

D. Joint Task Learning
During training, we jointly supervise body, edge, and the final semantic segmentation prediction. We use standard binary cross-entropy (BCE) loss for edge prediction and use standard cross-entropy (CE) loss for body and semantic segmentation prediction. The final loss function is expressed as follows: where y represents the ground truth label of final semantic segmentation. y e represents the real edge label obtained by performing edge detection on y.ŷ represents the final semantic prediction result that is the output of the end of the network. y e represents the edge prediction result, which is obtained from the edge feature F e .ŷ aux represents the auxiliary semantic prediction result, which is obtained by the fusion output of edge features F e and body features F b . Here, λ 1 , λ 2 , and λ 3 are three hyperparameters that control the weighting between the losses. Specifically, λ 1 is the loss of segmentation, λ 2 is the loss of body, and λ 3 is the edge loss. Because of the serious imbalance between edge pixels and body pixels, a coefficient γ is used in the boundary prediction loss L BCE (y e ,ŷ e ).

IV. EXPERIMENTS
This section conducts extensive experiments on three challenging RSIs' semantic segmentation datasets, including the Tianzhi, Vaihingen, and Potsdam datasets. An ablation study is first conducted to verify the effectiveness of each module of the proposed framework. Then, the proposed framework is compared with several SOTA baselines, and the comparison results also demonstrate the advancement of the proposed method. In order to highlight the effectiveness of the algorithm, the best result of the comparison algorithm will be shown in bold, and at the same time, the underlined data indicate the second-best result.

A. Datasets
The proposed approach is evaluated with experiments on the ISPRS 2-D semantic labeling benchmark and the dataset released from the Tianzhi cup artificial intelligence challenge.

B. Evaluation Criteria
The OA and mean intersection over union (mIoU) criterion are employed to assess the overall performance of semantic segmentation results. Besides, the F1 score is exploited to evaluate the performance of each category. It is assumed that there is a total of k + 1 categories (from 0 to k, and 0 represents the Backgrounds), and p ij stands for the number of pixels belonging to category i and being predicted as category j.
OA is a straight-forward metric computing a ratio of the amount of correctly classified pixels and the total number of The intersection over union (IoU) represents a ratio of the intersection of pixels predicted to be of a certain category and the ground truth pixels of that category and their union. The mIoU can be calculated by averaging the IoU for all the categories besides Backgrounds. The mIoU metric can be calculated as follows: The F1 score harmonic average of precision (OA) and recall. The recall and F1-score can be obtained as follows:

C. Implementation Details
All the experiments are carried out with the PyTorch framework. The proposed network is trained using Adam optimization with a batch size of 8. Then, the betas of (0.9, 0.999) and a weight decay of 5e-4 are set. The learning rate is initialized to 2e-5, and the "poly" policy is adopted to decay the learning rate by multiplying (1 − iter total_iter ) 0.9 after every training iteration. The experiments are conducted on a single NVIDIA GTX 2080Ti GPU.

D. Ablation Study
The following experimental results of the ablation study are conducted on both the Vaihingen and Potsdam datasets to evaluate the effectiveness of each component of the proposed method.  Table I. For the Vaihingen dataset, first, the ResNet-101 and simple upsampling are selected as the encoder and decoder of the baseline, which achieves 78.82%  mIoU and 90.20% OA. Then, the adaptive optimal multiscale representations are simply concatenated after the ResNet-101 part, denoted as the CAP layer, which obtains 79.19% mIoU and 90.47% OA with a slight enhancement. Furthermore, to verify the effectiveness of the proposed EGFTM module, it is inserted before the upsampling layer of the baseline, increasing mIoU to 90.60% and 80.48%, respectively. Finally, by employing CAP and EGFTM, the mIoU is enhanced to 90.65%, with a 0.45% increment better than the baseline. Moreover, the OA result is also improved to 80.75%, which is 1.93% better than the baseline. For the Pstsdam dataset, the baseline combined with CAP and EGFTM attains superior semantic segmentation performance, improving 0.16% OA and 0.45% mIoU. Similar ablation results are depicted in Cityscapes dataset and the results are more pronounced. Specifically, by employing CAP and EGFTM, the mIoU is enhanced to 76.36%, with a 5.13% increment better than the baseline. All these improvements demonstrate that the proposed modules greatly benefit scene parsing. The visualization comparison of the proposed modules is shown in Fig. 7.
For a more detailed analysis, the F1 scores of each class under different settings are tabulated in Tables II and III. From  Table II, it can be claimed that the class "Car" with a tiny  IV  COMPARISON WITH DIFFERENT BACKBONE ON THE VAIHINGEN DATASET   TABLE V  COMPARISON WITH DIFFERENT BACKBONE ON THE POTSDAM DATASET structure has been increased by 4.82%, whereas the large classes, named "Imp.S.," "Build.," "Low.V.," and "Tree," improved by 0.61%, 0.46%, 0.57%, and 0.09%, respectively. The F1 scores in the Potsdam datasets observed in Table III demonstrate that although the "Imp.S." class slightly decreased, the other classes achieved better performance than the baseline. In particular, the tiny targets named "Car" can obtain an improvement of 0.71%. Also, it can be observed that CAP can obtain more precise multiscale feature representations for each class with slight accuracy enhancement. Moreover, EGFTM generates more consistent segmentation inside large and tiny objects or along the boundaries, which is why EGFTM obtains a better increment than CAP.
2) Robustness of Different Backbones: To show the generalization capability of the proposed modules, a series of experiments is further conducted to compare exploiting different backbone networks. It is worth noting that the presented modules can be easily inserted into various backbone networks. The ResNet-50, ResNet-101, and ResNet-152 are selected as representatives. Due to limited context space, the OA, mIoU, and F1 scores of the classes "Build." and "Car" are chosen to verify the robustness of the proposed framework. The detailed criterion values of other large objects, such as "Imp.S.," "Low.V.," and "Tree," are also recorded, which have similar comments as the "Build." class. It should be noted that all the deep networks are equally pretrained on ImageNet. The comparison results are reported in Tables IV and V. Both modules combined with ResNet-152 achieve the best performance for irrelevant datasets, indicating that ResNet-152 has the most superior feature representational capability. Furthermore, the backbones exploited with the proposed strategies can obtain considerably better mIoU, OA, and F1 scores on either the Vaihingen or Potsdam datasets than without utilizing the proposed modules, especially for the tiny objects, the increments are larger than in other classes. All these experimental results imply that the EGFTM and CAP modules can enhance the boundary feature of tiny objects, which can effectively extract a more detailed multiscale context.

3) With Different Pyramid Levels:
The proposed CAP is constructed in a pyramid manner to extract spatial features from multiple scales. Meanwhile, the best representation can be adaptively obtained by the attentional weight. The models are assessed with different scale rates adopted in CAP to verify the effectiveness of adaptive optimal multiscale features. As given in Table VI, harvesting abundant multiscales of adaptive attentional features outperforms single-scale features, indicating the effectiveness and necessity of employing an adaptive attentional pyramid structure. Moreover, using a scale of 2 has the best OA and mIoU than the performance obtained from other single-scale, and using scales 1, 2, 3, and 6 in combination with the best OA and mIOU. Furthermore, the experimental results indicate that adaptively learning appropriate representations of objects aggregating multiscale features via different contexts is critical for improving the segmentation accuracy of tiny targets.

E. Comparison to State of the Arts.
The proposed method is further compared against several SOTA semantic segmentation baselines on the ISPRS 2-D semantic labeling benchmarks, Tianzhi dataset and cityscapes dataset. Furthermore, the complexity comparison is reported among the SOTA algorithms and the proposed one with the ResNet-101 backbone. The proposed method is moderatamente in terms of calculation complexity. Compared with DeepLabV3+ [10], although the Flops of the proposed method is slightly greater, both the number of parameters and occupied memory are smaller, and the inference time is almost the same.
Tables VII and VIII tabulate the results on typical IS-PRS benchmarks, named Vaihingen and Potsdam, respectively, demonstrating that the proposed method achieves the best performance for almost every land-cover class. Specifically, the proposed method obtains significant improvement and outperforms other SOTA methods for tiny objects, such as cars and some small buildings. The values of OA and mIoU achieved by the two tested modules are still better than those of not using, indicating that the EGFTM and CAP significantly enhance the segmentation performance in RSIs, especially for tiny objects. The corresponding comparison segment maps are also provided for visual perception. From Fig. 8, the edges of cars are more finely attained from the proposed method, demonstrating that the proposed modules can help improve the segmentation accuracy of tiny objects. Furthermore, all these results imply that the boundaries of objects are a vital part of precision in RSI semantic segmentation. Besides, the experimental results based on the Tianzhi dataset are also reported in Table IX and Fig. 9. For this typical RSI, existing with the winding and narrow roads surrounded by large-scale vegetation, it can be concluded that the proposed modules make an efficient impact on accuracy enhancement. The segmentation performance of all objects in the Tianzhi dataset obtains a positive lift. From the visual comparison in Fig. 9, the roads detected by the proposed method are the most accurate, with relatively complete and acceptable boundaries compared with other methods, indicating that the enhanced edges and adaptive different multiscale contexts work well in the RSI semantic segmentation task with complicated backgrounds. Furthermore, the proposed EGFTM and CAP module demonstrate extraordinary talent in tiny target segmentation. Table X and Fig. 10 show the segmentation results on the Cityscapes dataset. As can be seen from the table, our method also achieves the optimal segmentation result. Compared with the comparison algorithm, the proposed method achieves 3.1%, 0.58%, 1.64%, 3.04%, and 2.96% improvement in mIoU index, respectively.

F. Parameter Tuning
When designing the overall segmentation loss of the network, we simultaneously used three parameters, λ 1 , λ 2 , and λ 3 , to weight the segmentation loss, edge loss and body loss. In order to determine the final weight value, we set the overall segmentation loss as 1.0 unchanged, while changing the body and edge loss to conduct ablation experiments. Fig. 11 shows the influence of different λ 2 /λ 3 values on the final segmentation results. As can    be seen from the Fig. 11, the best segmentation results can be achieved when the lambda value is 0.4, so we take the weight of 0.4 for both edge and body loss. Table XI tabulates the number of parameters as well as FLOPS of each compared algorithm. The proposed method has a slight increase in the number of parameters compared with SegNet and PSPNet, but it is still less than OCNet, CCNet, and DeepLabV3+ because the proposed network utilizes both the edge features and body features extracted by the network and uses its loss calculation for backpropagation. At the same time, this network requires a lot of feature matrix operations in the CAP module. Therefore, the FLOPS may be slightly higher than the other networks. However, compared with the final performance improvement, we think the computational cost is acceptable.

V. CONCLUSION
This article proposes a novel semantic segmentation framework with two flexible and effective modules for RSIs, named FGN-CAP. The EGFTM leverages the edge and body information as a guide to enhance edge contours and internal consistency, which can explicitly improve the representation of small or tiny objects. The CAP adaptively aggregates multiscale features in a coarse-to-fine manner through different contexts, enhancing the performance of objects of varying scales in RSIs. Both strategies can be applied in any optional network to boost their performance. The thorough experiments show their effectiveness and robustness, which achieve SOTA on three public remote sensing benchmarks.