Class-Guidance Network Based on the Pyramid Vision Transformer for Efficient Semantic Segmentation of High-Resolution Remote Sensing Images

Small differences between classes and big variations within classes in multicategory semantic segmentation are problems that are not completely solved by the “encoder–decoder” structure of the fully convolutional neural network, leading to the imprecise perception of easily confused categories. To address this issue, in this article, we believe that sufficient contextual information can provide more interpretation clues to the model. Additionally, if we can mine the class-specific perceptual information for each semantic class, we can enhance the information belonging to the corresponding class in the decoding process. Therefore, we propose the class-guidance network based on the pyramid vision transformer (PVT). In detail, with the PVT as the encoder network, the following decoding process is composed of three stages. First, we design a receptive field block to expand the receptive field to different degrees using parallel branching processing and different dilatation rates. Second, we put forward a semantic guidance block to utilize the high-level features to guide the channel enhancement of low-level features. Third, we propose the class guidance block to achieve the class-aware guidance of adjacent features and achieve the refined segmentation by a progressive approach. The overall accuracy of the method is 88.91% and 88.87%, respectively, according to experimental findings on the Potsdam and Vaihingen datasets.


I. INTRODUCTION
D UE to the rapid development of high-resolution remote sensing Earth observation and the continuous improvement of satellite data resolution, robust semantic segmentation of remote sensing (RS) imagery is essential for Earth observation [1], land use [2], land cover [3], and urban planning [4], [5], which has undergone a long history.
In earlier times, traditional machine-learning-based techniques attempted to utilize a small number of handcrafted features (i.e., shape, texture, and color) to achieve the segmentation of ground objects from RS imageries. Nevertheless, the complex distribution of various ground objects makes it difficult to maintain robustness and generalization, resulting in poor precision and mobility [6], [7], [8]. Long before, owing to the strong ability of feature representation, deep learning approaches represented by convolutional neural networks (CNNs) were introduced to the semantic segmentation of RS imageries and have made tremendous progress. [9] Limited by the fixed input size of initial CNNs, the fully convolutional neural network (FCN [10]) was proposed to address this problem successfully. At the same time, the "encoder-decoder" structure of the FCN has become the mainstream framework of current deep learning methods. In detail, the encoder aims at yielding multilevel features by performing a series of convolution operations, and the decoder concentrates on aggregating the useful information of the encoded features and fusing it to output the final segmentation result. Based on such an "encoder-decoder" framework, numerous networks have emerged in recent years, which always employ the common backbones as the encoder, such as VGG [11], ResNet [12], and DenseNet [13], and adopt various strategies in the decoder to overcome the challenges brought by the huge intraclass variance and small interclass differences. At first, as only spectral information is insufficient to distinguish the ground objects from the complex surroundings, multiscale background information was utilized to assist the identification of ground objects [14], and some researchers have tried a lot to efficiently obtain and fuse multiscale information through multiscale training or feature fusion, which can solves the issues caused by various sizes of scene objects. Subsequently, some studies [15], [16] attempted to establish the long-distance dependence from a global perspective to access enough global contextual information.
Moreover, due to the limited receptive field of the convolutional operation, many works have been proposed to expand the receptive field and enrich contextual information. HRNet [17] adopted a parallel approach to maintain high-resolution representation and fuse the multiple features. U-Net [18] combined deep semantic information with shallow detailed information by skipping connections, however, it can only output features from a single scale. Some networks utilize dilated convolution or deconvolution to expand receptive fields at various levels to provide more comprehensive spatial context modeling to improve multiscale feature representation [19], [20], [21], [22]. DeepLabV3+ [23] developed the atrous spatial pyramid pooling (ASPP) to capture context information at various scales, and using multiple parallel layers of atrous convolution with different dilation rates allows the feature map generated by atrous convolution to have the same resolution as the input, enabling each output neuron to have a larger receptive field size without increasing the number of nuclear parameters. In order to establish the relationship between local and global, and to obtain more scale context information, the model is encoded with a higher level of semantics to obtain different reception domain information. PSPNet [24] introduced the pyramid pool module (PPM) to aggregate information from multiple receptive field sizes, and to integrate the global context information of different regions. Excellent pixel and prediction outcomes are obtained by allocating it to each pixel by up-sampling, but the computational efficiency is relatively poor. ABCNet [25] captures rich spatial details and global context information in high-resolution remote sensing images through bilateral architecture and designs a feature aggregation module to fuse the features obtained by the two paths, but the context path presents complex parameters and calculations. MANet [26] employs two layers of atrous convolutions with different dilatation rates and global average pooling (GAP) to extract context information at multiple scales in parallel to solve the problem of high differences in target sizes in remote sensing images, and embeds the channel attention mechanism to fuse semantic features. The transformer [27], [28], [29] has been brilliant in the recent visual field with its ability to gain global contextual knowledge through self-focus. In general, the success of CNNs is dependent on their two inherent inductive biases, namely translational invariance and local correlation. This feature is usually absent in the structure of the visual transformer, leading to the need for a large amount of data to surpass the performance of CNNs. In general, CNN can obtain more effective local features. [30] However, CNN's limited receptive field makes it challenging to get global information, whereas the transformer can capture long-distance dependencies. Thus, by combining CNN and transformer, the network structure may best preserve both local and global traits while inheriting the benefits of each. STransFus [31] combines the swin transformer model with the pretrained Resnet34 as a CNN backbone and uses a segmented model to provide rich coarse-and fine-grained feature representations. ICTNet [32] decoding design swin transformer block and convolution block (CB) deployment and interleaved extraction features, along with aggregation of coding features in the encoder phase. CCTNet [30] combines local and global context information captured by the CNN and transformer branches to avoid damaging their respective characteristics. Based on the aforementioned, we introduce a pyramid structure into the transformer framework (PVT, pyramid vision transformer) [33] to generate multiscale feature maps. The PVT still generates a global receptive field, which is better suited for intensional prediction tasks such as semantic segmentation. Compared with the ViT [34], [35], which also uses nonconvolution model to replace the CNN backbone, has the advantage of learning high-resolution representation, while also developing a progressive shrinking pyramid and spatial-reduction attention layer to reduce computing costs and resource consumption [33].
Different from the previous works, we argue that due to the presence of small interclass variance and large intraclass variance in remote sensing images, misclassification and omission are more likely to occur. To address this issue, in this article, we approach it from two perspectives. On the one hand, we believe that remote sensing imagery covers a wide range of ground objects, and sufficient contextual information can provide more interpretation clues to the model. On the other hand, if we can mine the class-specific perceptual information for each semantic class, we can enhance the information belonging to the corresponding class in the decoding process. Therefore, we propose the class-guidance network based on PVT. In detail, PVT-v2-b2 as the encoder network and CNN as the decoder will avoid damaging the respective characteristics of the CNN and transformer. The following decoding process is composed of three stages. First, referring to the practice [36], multiscale contextual information is of great importance to enhance the encoded features. In general, the receptive field of each layer is fixed, which will lose some information and lose the ability to distinguish different fields of view, such as the important part near the center. Besides, according to [20], the contribution of pixels in the receptive field to the output of the neural node is not the same, and more specifically, the contribution of the pixel in the center of the receptive field is greater. Therefore, based on these, we designed a receptive field block (RFB) using a similar idea of inception [37] and ASPP [23], we utilize the multibranch pooling structure with different kernel sizes and make use of the atrous convolution to eccentricity. Second, The spatial position and boundary information of geo objects are seriously lost due to rough up-sampling operation, resulting in incomplete structure and rough boundary. Meanwhile, the feature representation ability of most networks is not enough to deal with easily confusing geo objects. [16] However, there are inherent semantic differences between high-level semantic features and low-level visual features. Directly fusing features of different levels will result in representation errors. Given the aforementioned issues, we propose the semantic guidance block (SGB) to utilize high-level features to guide the enhancement of low-level features in channels. And learn an adaptive weight for each channel to determine which channels need more attention. Third, when it was found that the segmentation results obtained by direct multilayer convolutional operations on high-level features were relatively coarse. We establish a class-aware guidance mechanism to mine class-aware information by a novel class guidance block (CGB). That is, the coarse segmentation results are used to guide the fusion of two adjacent features for a finer segmentation.
In this article, combining the global context of the transformer with the local details of the CNN will help improve the quality of the remote sensing image semantic segmentation model. The results of this article have the following three obvious characteristics.
1) A multiscale receptive field expansion module is designed to obtain richer context information. 2) The perception of foreground is improved by the category relation matrix. 3) CGB guides the feature fusion and enhancement layer by layer, outputting more refined segmentation results.

II. METHODOLOGY
As shown in Fig. 1, the overall architecture of our proposed class-guidance network based on the PVT is based on the typical encoder-decoder structure. Especially, we adopt the pretrained PVT-v2-b2 as the encoder network to extract the multilevel encoded features from the input remote sensing image. Afterwards, we develop a set of novel and powerful decoding strategies to interpret the multilevel encoded features.
The whole decoding process can be divided into the following three stages.
1) In the first stage, in order to achieve the extraction of multiscale context information, we introduce an RFB to expand the receptive field and enrich the multiscale contextual information of each level of features. 2) In the second stage, we put forward SGB to take advantage of the highest level semantic information to achieve the enhancement of other low-level features. 3) In the third stage, with the relatively coarse segmentation result, we propose the CGB to achieve the class-aware guidance of adjacent features and output a refined segmentation result. Based on the progressive strategy, we can finally obtain the detailed segmentation result. Table I describes the detailed architecture of the proposed network. The four-layer feature maps of the expanded receptive field are obtained by global adaptive pooling to obtain the channel response maps, and then, multiply and add with the previous layer features to complete the fusion of high-level features and low-level features.

A. Pyramid Vision Transformer (PVT)
The PVT inherits the advantages of both CNN and transformer, where it can be used as a direct replacement for CNN backbones. The overall architecture of the PVT is shown in Fig. 2. Overall architecture of PVT: Set the fine-grained image block (4×4 pixels) as the input, the whole pyramid architecture is divided into four stages to produce feature maps of various scales, and the output resolution is gradually reduced from high to low, from 4 stripes to 32 stripes. The architecture of all stages is similar and consists of layers of patch embedding and transformer encoder. [33]. It can not only train on dense partitions of an image to achieve high output resolution, but also uses a progressive shrinking pyramid to reduce the computations of large feature maps. And adopting a spatial-reduction attention (SRA) to further reduce the resource consumption when learning highresolution features. The calculation formula is as follows: [33] where Concat (.) is the concatenation operation as in are linear projection parameters. SR(·) can reduce the spatial dimensionality of the input sequence (i.e., K or V ). The operation of Reshape(x, R i ) involves reshaping the input sequence x into a sequence of size is a linear projection that reduces the dimension of the input sequence to C i . Norm(·) refers to layer normalization.

B. Receptive Field Block (RFB)
As mentioned before, after PVT-v2-b2 achieves the feature extraction, we attempt to introduce sufficient contextual information, and we design the RFB. As presented in Fig. 3, the proposed RFB consists of four parallel branches. At the beginning of each branch, we all use a 1 × 1 convolution to reduce the original channel dimension of the input features. For the first and second branch, we view them as original information of the input features, which is utilized in the final residual structure and the concatenation, respectively. For the last three branch, we use three successive convolution operations with different kernel sizes and atrous rates. The reason why we use four parallel branches is that we attempt to build a parallel structure in which the receptive field changes exponentially. In detail, the receptive field processed by 1 × 7, 7 × 1 and the receptive field processed by 1 × 5, 5 × 1 are four times and twice as much as the receptive field processed by 1 × 3, 3 × 1. In addition, the usage of 1×1 convolution is to reduce the original dimension of the input features, and we use element-wise addition and concatenation is try to aggregate the original information for spatial and channel aspects.
In detail, we first use two convolution with the kernel size as 1 × (2k − 3) and (2k − 3) × 1 in turn to capture the information from two vertical directions, then we utilize the 3 × 3 convolution with the atrous rate of 2k − 3(k > 2) to expand the receptive field of the input features. Subsequently, we concatenate the last four branches (b 2 , b 3 , b 4 , b 5 ) together. Finally, we use a residual connection to add the features of first branch and the concatenated features, accompanying with an ReLU activation function to further increase the nonlinearity of the processed features. Under such a series of operations, the input features are enhanced by expanding receptive field and enriching multiscale contextual information.
The ASPP module structure [21], [38] serves as the basis for this article's parallel convolution multiscale context extraction approach. Many modules use atrous convolution or large convolution to roughly increase the receptive field. ASPP uses different dilation rates to connect multiple Atrous convolution features to the final feature representation. Our input image has high resolution. To achieve a large enough receptive field in ASPP, a large enough dilation ratio has to be employed. However, as the dilation rate increases (e.g., d > 24), the atrous convolution becomes more and more ineffective [20]. Therefore, our improvement over ASPP is the use of multiple convolutions, varying the atrous rate, and each branch being a string rather than just a simple convolution operation, which can capture more contextual information. The output results of each branch are then connected to create a multiscale feature map.

C. Semantic Guidance Block (SGB)
As we all know, the highest level features always own enough semantic information, which is related with the abstract information about class and attribute. In order to suppress the high false alarm rate (general segmentation is easy to judge some objects in the background as foreground), this article designs the SGB to strengthen the connection between the foreground and enhance the difference between the foreground and the background. Specifically, we extract the deepest global feature vector in the network and calculates the semantic relationship between the scene and the target in the current feature so that the relationship matrix is obtained, which can be further used to enhance the foreground features in the image, and reduce false positives. In this article, channel weights are generated by GAP from global features. Fully connected layers adaptively learn these weights. Finally, by multiplying these weights, the fusion features are adjusted. For each pyramid level, the process details of relationship modeling are shown in Fig. 3.
For higher level feature maps: Here, H, W , and C represent the height, width, and number of channels of the image, respectively. We first use global adaptive pooling for the deep-level feature graph F h to obtain the channel response graph R as Then, we perform the 1 × 1 convolution to reduce the dimension of R and we can obtain R 0 . The number of channels of R 0 is half of that of R. Then, we use the sigmoid function for R 0 and replace it with probability R 1 as Let the weight of each channel range from 0 to 1, that is, the weight value of these C channels so that we can know which channel is more important and which channel is less important during the training process from the relationship matrix. We can improve the recognition ability of foreground features by associating the relevant context of the spatial scene. Then, we multiply R 1 with the reduced dimension F l to complete the fusion of high-level features and low-level features to obtain the feature map F of the final fusion feature.

D. Class Guidance Block (CGB)
Dense pyramid network [39] combines pyramid pooling modules with two convolutional layers to achieve multiresolution (high-level and low-level) feature fusion. In addition, the feature pyramid network [40], [41] is designed to connect lower level and higher level semantic information by combining higher level feature with lower level features after up-sampling and integrating additional semantic information into the lower level features with enough spatial information from the top down. Based on this, we improve the feature map by using shallow high spatial details and deep strong semantics, which is beneficial for restoring the object's details and multiscale context modeling. Subsequently, CGB is designed.
In the proposed CGB, the higher level feature are processed by a CB to obtain the coarse segmentation result. The coarse segmentation result is used to guide the fusion of the two adjacent features, and the following decoding output is more accurate. Specifically, for the usage of the prediction from the former level, we put forward a progressive guiding strategy to take advantage of the former prediction to help the adjacent layers of the features learn the relatively definite spatial relationship. Regarding the former prediction, it already owns relatively specific semantic classes for each pixel, and we use the Sigmoid function to allocate every pixel with the corresponding probability of each class, the errors resulting from it will be hugely eliminated by the fusion process, and we can clearly enhance the features and output a better-refined segmentation result. According to this strategy, the feature enhancement is guided layer by layer, and four output results are obtained (the fusion of the four features corresponds to four outputs, and the last output result is the most precise output). The specific steps are as follows.
For higher level feature maps: Lower level feature map: F l ∈ R B×C×H×W . First, F l is up-sampled so that its size becomes R B×C×H×W , and then, we add the high-level features and low-level features element by element to obtain: M ∈ R B×C×H×W . Then, we separate the channel of the last coarse segmentation result P ∈ R B×N ×H×W (segmentation category N ) to get P i : P i ∈ R B×1×H×W , i = 1, 2, . . . , N.
Perform sigmoid operation on P i , the probability value of each point on P i represents the probability that the pixel of this point belongs to the category, and then, multiply M to obtain the enhanced features of each category: M i ∈ R B×C×H×W , i = 1, 2, . . . , N.
Add the features of each category element by element to get the feature map O as Finally, the final result can be output through three convolutions.

A. Data Description
This article selects the Potsdam and Vaihingen datasets on the official website of the ISPRS as the experimental data. Sample  images of two datasets and their corresponding labels are shown in Fig. 4.

1) Potsdam:
The Potsdam dataset image has a 5-cm spatial resolution. Each image in the dataset comprises an affine transformation file that may be used to break it down into smaller images as needed. The data collection consists of three bands of GeoTIFF. The dataset also offers image channel combinations in tiff storage form, including IR-R-G, R-G-B, and R-G-B-IR.
In this experiment, the image with a boundary label is selected as the label image, including six categories: Impervious surfaces, buildings, low vegetation, trees, cars, and clutter. We use the images 2_13, 2_14, 3_13, 3_14, 4_13, 4_14, 4_15, 5_13, 5_14, 5_15, 6_13, 6_14, 6_15, and 7_13 as the test set, 2_10 as the validation set, and the rest as the training set.
2) Vaihingen: Vaihingen dataset image spatial resolution is 9 cm, including 33 true orthophotos, and the average spatial size is 2494 × 2064. The ground reality includes seven categories that are the same as the ISPRS Potsdam benchmark. We use 4,8,12,14,20,24,29,33, and 38 as the test set, 30 as the validation set, and the rest as the training set. According to the same data-cutting method as Postdam, only red, green, and blue channels are used in the experiment.

B. Experimental Parameter Setting
The parameter set should be precisely established and identified before the experiment. We train the networks on NVIDIA GeForce RTX3090 GPU, which totally have 24-GB RAM. For each of these two datasets, we train our proposed networks separately. Table II describes some data descriptions and training details. The training environment is Pytorch1.8.1 and cuda11.1, using the Adam optimizer. We adjust the learning rate according to the training epoch. The initial learning rate is set to 10 −4 , and every 50 epochs attenuate ten times, a total of 100 epochs. Because after the experiment, the accuracy reaches the saturation state at 100 epochs. The multiclass cross entropy is used as the loss function, and the ablation experiment is conducted with two other loss function (Dice loss and NLLLoss) commonly used in semantic segmentation to prove its effectiveness.

C. Evaluation Index
The overall accuracy (OA), mean intersection-over-union (mIoU), and F1-score (F1) are selected as our evaluation markers to fully assess the semantic segmentation results of this method. In addition, for pixel-level classification tasks, precision, and recall are utilized for prediction when the categories are uneven, calculated by (8) and (9), respectively, OA represents the overall evaluation result of the overall pixel. The formula is as follows: F1-score is a composite measure of recall rate and accuracy, and its formula is as follows: IoU refers to the ratio between the intersection and the union between the predicted result of a category with a real label. mIoU refers to the average of IoU of all categories, and its formula is as follows:  TN, FP, and FN). TP and FP are the numbers of foreground pixels correctly and incorrectly predicted as the foreground, respectively; TN and FN are the number of background pixels correctly and incorrectly predicted as the background, respectively.

D. Comparing With Existing Works
As mentioned earlier, the PVT-based foreground sensing network model we designed is a segmentation network combining a PVT and traditional convolutional decoder. The purpose is to enrich the global context information with long-term dependencies and local patterns, to improve segmentation accuracy. Therefore, the numerical evaluation of segmentation accuracy has been experimentally studied, including UNet (2015) Fig. 5, which represent the visualization results after clipping and before clipping, respectively.
It is clear that PSPNet produces low-precision segmentation results and is insufficiently robust to preserve the delicate foreground edges, which leads to the extracted buildings' irregular shapes and fuzzy boundaries. The strategy used in this article results in a straighter edge line for the building and a more evident right angle. Nearly all of the predicted areas match the reference label . Due to transformer's capacity to model global relationships, the PVT can segment huge objects (such as buildings) more accurately than the CNN model. In addition, other model methodologies exhibit severely decreased performance for low vegetation and trees. In this article, the weight matrix is obtained by calculating the correlation, and the eigenvalues are adaptively enhanced or attenuated to make the pixel value representation of classes more accurate so that the differences between small classes between trees and low vegetation can be seen. The proposed method shows smoother results in these challenging categories. We calculated the IoU for each category and the average of other metrics for each of the models that were examined. Table III provides a summary of the findings.
It can be seen from the results that the PVT-based foreground sensing network proposed in this article ranks first in F1 Ave, OA, and mIoU, and obtains 85.79% results in F1 average, 88.91% results in OA, and 77.32% results in mIoU. UNet is the best of all networks using the CNN as the backbone. Compared with our method, F1 Ave, OA, and mIoU have increased by 0.95%, 1.33%, and 1.49%, respectively. Although feature fusion is taken into account when comparing the five techniques, the weight of feature fusion is not taken into account. Our proposed method can be modified by learning the weight of the fusion feature. As a result, our method may categorize data Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  more accurately, eliminate false positives, and more effectively address the issue of mistaken perception of easily muddled categories. For instance, it is easy to classify low vegetation and high trees incorrectly. But our approach performs better in these two categories, with IoU rising by 1.95% and 2.76%, respectively. These results prove the effectiveness of our decoder design. In addition, compared with other categories, the vehicle category is a small target and the impermeable surface category is a big target. The IoU value increased by 0.46% and 2.28% for these two categories, which indicates that the multiscale context extraction method (PVT) proposed by us can solve the problem of an excessive disparity in the size and shape of the two objects, and obtain greater benefits in large-scale datasets. As a result, a transformer-based network offers competitive accuracy in semantic segmentation over most ResNet-based approaches. In order to verify our whole process, we have verified the accuracy of Predict1-4, and the results are shown in Table IV . The four-layer feature fusion output result is obtained through our CGB. The last output result is the most detailed. Compared with the first coarse segmentation result, mIoU can increase by 4.58%, which proves that the encode-decode structure designed in this article can extract features more accurately and effectively fuse features, thus improving the segmentation performance.

2) Vaihingen Data Experiment Results:
The second group of experiments was conducted on the Vaihingen dataset. The visual comparison results are shown in Fig. 6.
It is observed that the cars in the segmentation outcomes of other methods are adhesion and incomplete and that the identified car boundary information is not accurate enough. Our method can accurately draw complete shapes and distinguish between small ground objects, such as cars, without stickiness. Additionally, several missing and incorrectly extracted areas may be seen in the data produced by HRNet, DeeplabV3+, and PSPNet, and there is noticeable distortion around the building's edge. The occurrence of incorrect division and missing division is reduced by UNet and ABCNet. The comparison demonstrates that our method extracts smaller buildings that are more comprehensive, that have details that are more accurate to the reference label, and that have smoother created building outlines. For all comparison models, we calculated the average value of F1 and other indicators for each category. The results are summarized in Table V.
The findings show that our method's OA for the Vaihingen dataset is 88.87%, the average value of F1 is 82.05%, and the mIoU is 71.79%, which is 1.48%, 3.5%, and 3.73% higher than its nearest competitor, UNet, respectively. Although the data volume of Vaihingen is relatively smaller than that of Potsdam, our method still achieves better performance. Particularly for the vehicle category, IoU increased by 3.59%. Because trees and buildings block autos, which make up a small proportion of the training image's pixels, it is easy to be divided incorrectly or missed. The network model we proposed uses PVT to extract the features of different scales, and the decoder design part adaptively fuses different levels of semantic information. Therefore, even if the foreground objects occupy a small area in the image, they can be recovered and combined to generate effective features and perform accurate segmentation. In the experimental results, despite the uneven distribution of categories, the average precision and recall rose across all categories by 0.79% and 3.63%, respectively. The numerical results and visualization results confirm the effectiveness of our method. In addition, we have verified the accuracy of Predict1-4, as shown in Table VI. Compared with the result of the first coarse segmentation, the output result of category guidance to the last time can increase by 16.07%. This proves that our proposed network architecture shows a better segmentation performance than the other five models.

E. Computation and Parameters Efficiency
We report the running times for training and testing on both datasets as well as the number of parameters of different models in

F. Ablation Experiment
Results of ablation experiment: PVT is selected as the encoder in this article, and the U-shaped structure of the UNet structure is used as the baseline in the decoding part, which is composed of three components: RFB, SGB, and CGB. The ablation experiments were evaluated on the ISPRS Potsdam and Vaihingen datasets, and the ablation studies were quantitatively examined to prove the efficacy of the three-stage technique of the decoder in this study. Table VIII displays the results of the  ablation experiment. 1) Baseline: We choose PVT as the encoder for feature extraction, and input the output four-layer feature map to the decoder. The decoding part uses the U-shaped structure of the UNet structure as the "baseline." The baseline can be used as a standard to evaluate the performance of network components.
2) Baseline+RFB: In the decoder, the RFB is designed to capture the global context information, branch and process the features of each layer, and obtain the feature maps of different scales. Hence, a simple variant is more capable of capturing multiscale context information. Compared with the baseline, the performance of "Baseline+RFB" will demonstrate the effectiveness of the RFB.
3) Baseline+SGB and Baseline+RFB+SGB: Rich spatial information is essential for semantic segmentation, hence it is required to create a feature map with multiscale and high-level semantic information. Aiming at utilizing the highest level features to provide other low-level features with semantic guidance, we propose SGB. The table shows that "Baseline+SGB" is 0.76% and 0.25% higher in terms of mIoU than "Baseline," while "Baseline+RFB+SGB" is 0.67% and 0.43% higher than "Baseline+RFB," demonstrating the efficacy of SGB for feature fusion.

4) Baseline+CGB and Baseline+RFB+CGB:
The four layers of feature fusion are achieved by the CGB by layer-by-layer feature augmentation, element-by-element feature addition for each category, and fusion of context and spatial information. In terms of mIoU, "Baseline+CGB" is 1.79% and 0.77% higher than "Baseline," while "Baseline+RFB+CGB" is 0.67% and 0.60% higher than "Baseline+RFB," proving the effectiveness of this segmented fusion approach.

5) Baseline+RFB+SGB+CGB:
The method of combining the three components of the working decoder design has achieved higher accuracy. Compared with "Baseline," the segmentation performance of "Baseline+RFB+SGB+CGB" has improved by 3.02% and 1.23% in mIoU, which proves that this article provides the best fusion scheme.
In addition, we compare the performance of different loss functions by ablation experiments in the Vaihingen dataset to demonstrate the reliability of the multiclass cross entropy used as the loss function in this article. The experimental data results and visualization results are shown in Table IX and Fig. 7. The use of multiclass cross-entropy as the loss function in this article is the best for the segmentation effect and accuracy of most categories.

IV. CONCLUSION
The advantages of multiscale modeling in terms of longdistance visual dependence are shown by the introduction of the PVT as the encoder and becoming the pure transformer backbone of pixel-level dense prediction task design. With limited computing and storage resources, the PVT can flexibly learn multiscale and high-resolution fine-grained features and compare them with the feature map extracted by the popular ResNet, proving that it has a more powerful feature expression ability than the CNN. It can also help to distinguish between geographical objects that are easily confused.
We developed a special decoding and segmentation method called class-guidance network utilizing the CNN and designed three modules to recover features with high accuracy and efficiency. We use correlation to monitor the relationship between pixels in various scenes when obtaining accurate multiscale features. Then, we establish a class-aware guidance mechanism to mine the class-aware information and achieve the refined segmentation by a progressive approach. We use ablation research to prove the effectiveness of each module.
This research is limited to studying how to improve the network model's segmentation accuracy. The computational complexity of the PVT for high-resolution input is still high. In the future, we encourage further research on the deep integration of the transformer and CNN. Based on this work, we discuss how to calculate complexity and get efficient segmentation methods. Shuang Du is currently working toward the master's degree in surveying and mapping and science and technology with the School of Transportation and Geomatics Engineering, Shenyang Jianzhu University, Shenyang, China.
Her research interests include computer vision, image processing, deep learning, and the method and application of semantic segmentation of high-resolution remote sensing images.
Maohua Liu received the Ph.D. degree in agricultural resources and environment from Shenyang Jianzhu University, Shenyang, China, in 2021.
He is currently an Associate Professor with the School of Transportation and Geomatics Engineering, Shenyang Jianzhu University, where he is mainly involved in teaching and scientific research in the fields of GIS, remote sensing, and precision engineering surveying.
Dr. Liu has participated in one national fund project, and presided over and participated in more than 20 provincial and municipal level projects.