DSViT: Dynamically Scalable Vision Transformer for Remote Sensing Image Segmentation and Classification

The relationship between the foreground target and the background of remote sensing image is very complex. The vision task of remote sensing image faces the problems of complex targets and unbalanced categories. These problems make the modeling method have further improvement space. Therefore, this article proposes a dynamically scalable attention model that combines convolutional features and Transformer features. It can dynamically select the model depth according to the size of the input image, which alleviates the problem of insufficient global information extraction of the single convolution model and the computational overhead limitation of the pure Transformer model. We validated the model on two public remote sensing image classifications and two remote sensing image segmentation datasets. The accuracy and mean pixel accuracy (mPA) of the method in this article reached 96.16% and 93.44%, respectively, on the university of california (UC) Merced classification dataset. Compared with some recent work, the method has a net improvement of 5.0% and 4.82% over the pyramid vision transformer (PVT) model. On the Potsdam segmentation dataset, the accuracy and F1 of the transformer and CNN hybrid neural network (TCHNN) model are 91.5% and 92.86%, respectively. The performance of the method has improved 0.64% and 1.0%, and the other two datasets have also achieved the best results.

Krizhevsky et al. [1] trained a deep convolutional neural network (AlexNet) with a large number of parameters on ImageNet1K to classify 1.2 million images containing 1000 categories, and achieved the lowest error rate of that year. Ronneberger et al. [4] proposed the famous U_Net framework based on the concept of the fully convolutional network. The encoding-decoding structure of this framework can learn images in an end-to-end mode, and moreover can accurately and efficiently segment cell images. U_Net has become a mainstream baseline in image segmentation tasks. The you only look once (YOLO) [7] series, a one-stage target detection framework based on deep learning, has become the most popular target detection algorithm at present. It converts the definition of object detection frames into a regression problem, which can quickly and effectively detect target objects. Another popular target detection framework is the regions with CNN (RCNN) [8] series, which uses the CNN to extract features from each proposal region. It is a two-stage detection framework, which lays the foundation for the later Fast-RCNN [9] framework. Xu et al. [10] proposed a generative adversarial network called dynamic resblock generative adversarial network (DRB-GAN) to transfer the artistic style of one image to another, showing excellent performance in visual quality.
Transformer [15] was first applied in the field of NLP, and it overcomes the problems of long short term memory (LSTM) and gated recurrent Uunit that cannot be trained in parallel and require a large number of memory resources. Due to the unique advantages of the Transformer framework, it has not only become the main framework in the field of NLP, but also has become more and more widely used in the field of computer vision (CV) under the promotion of a large number of scholars.
Based on the Transformer model, bidirectional encoder representation from transformers (BERT) [16] improves the Transformer model. It pretrains on unlabeled text, and fine-tunes the output layer to apply to other NLP tasks. Inspired by BERT, Brown et al. [17] proposed a Transformer model with a very large number of parameters (175 billion) for general pre-trained transformer-3 (GPT-3). GPT-3 only passes the pretrained model without training or fine-tuning, which just can show strong capabilities in the midstream and downstream tasks of NLP. Dosovitskiy et al. [18] first divided the image into patches of the same size, and then serialize these patches and input them into a pure Transformer model. large number of datasets, it achieves better results than CNN in image classification tasks. Enze et al. [19] proposed a semantic segmentation framework called SegFormer, which unifies Transformer and multilayer perceptron for image segmentation and the framework was verified on multiple datasets, proving its efficiency and accuracy.
A remote sensing image contains many targets. The targets of the same kind are densely arranged, the size of the targets of the same kind varies greatly, the color and texture vary greatly, and most of the targets are very small. For the convolutional neural network, due to the restriction of the size of the convolution kernel, the convolution kernels generally used at the present stage are all small kernels. Therefore, the convolutional neural network has poor performance in capturing global information. Although the multilayer pooling operation can achieve the consequence of capturing global information, the information loss is serious and the information interaction is insufficient in this process. For Transformer, it obtains long-sequence global interaction information by calculating global attention, but this operation requires a large amount of calculation, and the amount of calculation increases quadratically when faced with midstream and downstream tasks in image processing. In addition, the pure Transformer is sensitive to parameters and requires pretraining with massive data.
We analyze the above problems, and then we put forward a dynamically scalable visual Transformer, which complements the advantages of the CNN and the Transformer model. It can alleviate the problem of their separate use. The main contributions of this article are as follows.
1) We propose a dynamic and scalable visual Transformer framework that combines the locality of feature extraction from CNNs with the globality of Transformers to establish contextual connections. 2) Transformer obtains global information through selfattention. We extract the self-attention mechanism and integrate it into the convolution operation, then design a dynamically scalable attention module (DSA) that can meet the midstream and downstream tasks of image processing, which can process large-scale images. 3) We evaluate the proposed DSA in different remote sensing image processing tasks, including image classification and semantic segmentation. Compared to the state-of-the-art methods or models, our proposed DSA achieves state-ofthe-art results in these two tasks.

A. CNN for Remote Sensing Image
Remote sensing image analysis is of great significance in Earth observation, urban planning, and environmental protection, but the manual handling procedure of these remote sensing images is cumbersome and complicated. The CNN can effectively extract the local information of the image by using the translation invariance of the convolution kernel and the pooling kernel, and in the meantime, it can preserve the spatial semantic features of the image. As a consequence, using CNN to process remote sensing images can effectively help staff analyze remote sensing images.
Penatti et al. [20] applied convolutional neural networks to aerial and remote sensing image classification tasks in addition to obtaining the best classification results at that time. Multimodal or multisource remote sensing images can more comprehensively analyze the feature information of ground objects. Danfeng et al. [21] designed a general multimodal deep learning framework. Through different fusion strategies, image data of multiple modalities were input into the framework, which effectively alleviated the bottleneck of inaccurate fine classification of remote sensing images in complex scenes. Xiaodong et al. [22] designed a dual-branch CNN architecture, which can fuse the features of hyperspectral images and light detection and ranging (LiDAR) data, and has shown excellent performance on multisource datasets. Zhiyong et al. [23] designed a new deep learning architecture that combines a spatial spectral attention mechanism and a multiscale dilated convolution module, which can capture more detailed features in remote sensing image change detection, and the detection accuracy has been obviously promoted. Maggiori et al. [24] implemented the fine-grained classification task of remote sensing images by using the convolutional neural network and a multiscale neuron module, and effectively alleviated the tradeoff between recognition and accurate positioning. Sharma et al. [25] analyzed the spatial relationship between a single pixel of a remote sensing image and its neighborhood, and proposed a depth learning framework based on a spatial neighborhood patch, which can effectively classify the remote sensing image with medium resolution. Starting from the mathematical analysis of parameter optimization, Yang et al. [26] designed a network called HPS_Net, which can adjust the relationship between feature maps and pixel path selection, and can effectively segment ground objects. Due to the complex features of remote sensing images, a relatively large model is required to capture the features, so designing a small model and achieving good results is also a research direction.

B. Transformer for Remote Sensing Image
Transformer achieves the capability to model long-sequence global context information interaction through the self-attention mechanism. In the processing of remote sensing images, the acquisition of global information is crucial. However, the computational overhead of self-attention scales quadratically with the sequence length. For large-scale data like images, the cost of calculating self-attention is expensive, so it is very meaningful to design a reasonable method to reduce the image scale.
Limiting the influent scope of self-attention can effectively reduce computational complexity. Swin Transformer [27] limits the size of the moving window (16 × 16) domains, allowing self-attention to be calculated in nonoverlapping shift windows, greatly reducing the cost of calculation, and meanwhile, global information about the image is obtained. Lei et al. [28] proposed the wide-context network, which can obtain detailed features of high-resolution remote sensing images by fusing CNN and context branches to accomplish the target of accurately classifying diverse ground objects. Yongtao et al. [29] designed a novel cross-context, cross-scale architecture (C2-CapsViT), which can effectively fuse global and local semantic features to accomplish state-of-the-art performance in remote sensing image scene classification. DynamicViT [30] learns a dynamic sequence sparsification strategy to reduce the computational complexity while maintaining the global interaction. Zhengzhong et al. [31] designed a scalable attention model, which can realize the interaction of global information and local information about the image on the basis of linear complexity and at the same time, it can integrate with convolution, which performs the task downstream of the CV out of a good performance. The combination of Transformer and generative adversarial network (GAN) [32] has also accomplished excellent achievements in the field of vision.
Convolutional neural networks use translation-invariant operators to process images, and Transformer's powerful global modeling capabilities make it a triumph in the field of NLP and then shine in the field of computer vision. However, a bottleneck of convolutional neural network is that the limited kernel size leads to insufficient extraction of global feature information, and the Transformer cannot handle large images due to the limited computational overhead and input length. The fusion of these two models can achieve complementary advantages. The above article also proves this conclusion, and simultaneously also provides a theoretical basis and foundation for the method of this article.

A. Dynamically Scalable Attention Block
The experimental results show that the existing deep learning algorithm is insufficient for the extraction of complex feature features, and there is still room for improvement. Therefore, the convolutional network and Transformer are effectively integrated in this article, and the designed network model can better extract complex feature features. In addition, the defect of Transformer is that it requires very large computational power to process large images. Generally, small images are used, which makes the original image have to be processed into smaller image blocks, which is a loss of global information. Therefore, for large images, the feature extraction ability of fixed-depth models is insufficient, such as ResNet50 [2] and ResNet110. Their difference in feature extraction ability of large images is obvious.
The convolutional neural network can efficiently extract the local information of the input features through the filter, and in addition, the Transformer can model the global context information of the input sequence through the self-attention mechanism. Therefore, we propose a dynamically scalable feature extraction module dynamically scalable attention block (DSA_Block) that integrates convolution operations and self-attention mechanism, as shown in Fig. 1. The advantages of this module are as follows: Convolution operations are applied to extract local contextual information of input features. Since the self-attention mechanism is very computationally intensive, the size of the input features is reduced through a dynamically scalable pooling kernel, and then the self-attention weight of the input features is calculated. The self-attention weights are fused with the input features, so that the output features not only have the local information of the input features but also contain the global information.
We use sequential connections between Transformer and CNN models in order to reduce the computational cost through dimensionality reduction of applying the Transformer model to the entire feature map. The convolution operation can effectively extract local contextual features while also reducing the dimensionality of the feature map, thus speeding up the training process. On the other hand, the weights generated by the Transformer model can highlight the relevant features in the convolutional feature map and establish long-range dependencies between features, enabling subsequent layers to extract meaningful information. This allows the model to better capture the complex relationships between different parts of the input data.
Firstly, the input features of DSA_Block are x ∈ R H×W ×C subjected to a convolution operation to extract local information. The specific process is as follows: After x undergoes two convolution operations, the output features are out c ∈ R H×W ×N . conv 3×3 (·) represents a convolution operation with a convolution kernel of 3 × 3, nor(·) represents batch regularization, and R(·) represents a rectified linear unit (ReLU) function operation. After the convolution operation, first scale down the size of the factor s, which is applied to further decrease the dimension of the input feature and reduce the calculation amount of self-attention. The scale reduction factor is dynamically scalable, and it is conditioned by the downsampling scale k of the input image encoding process and the number L of DSA_Blocks used. If fewer layers of coding modules are used (L is smaller), a relatively large downsampling scale (k is large) is used. On the premise of using more layers of encoding modules (L is larger), a smaller downsampling scale (k is smaller) is used. The depth position of DSA_Block is different, then the value of s is different. Assuming that the depth position of DSA_Block is i, and the downsampling scale is k, then the scale reduction factor at this time is The value of k is determined jointly by the size of the input image, the depth of the DSA module, and the size of the lowest-level feature map. This allows the neural network to more flexibly adapt to input images of different sizes, while also controlling the computational cost and model size, thereby improving the efficiency and generalization ability of the model. We calculate the value of k using the following formula: m × k L−1 = n, where n is the size of the input image, L is the depth of the DSA module, and m is the size of the lowest-level feature map. In this article, the input image size is 256 × 256, the depth of the DSA module is 5, and the size of the lowest-level feature map is 16 × 16, so we choose k = 2. Typically, reducing the size of the feature map can reduce the number of pixels that need to be processed. At the same time, downsampling can extract higher-level features by merging information from adjacent pixels.
For the output feature out c of the convolution operation, the size of out c is reduced according to the scale reduction factor. At this time, the feature size out de ∈ R W s × H s ×N is reduced by s 2 times, out de refers to the output obtained by downsampling the output out c of a convolution operation, and out de is applied to compute the self-attention weight. The specific process is as follows: out f = f latten(out de ). (2) f latten(·) is a flattening operation, which flattens out de to out f , out f ∈ R W H s 2 ×N . Next, layer normalization is performed on out f to output out bl , out bl ∈ R W H s 2 ×N . The self-attention mechanism realizes the dynamic aggregation of information through the interaction between queries (Q) and key (K)-value (V ) pairs. The specific process is as follows: Linear(·) is the linear mapping function, On this basis, the similarity A i between Q i and K i is calculated, and V i is weighted according to A i . The specific process is as follows: Among them, T emp ∈ R N ×N is the intermediate value of Q and K T , and perform the softmax operation on Temp to output the similarity A ∈ R N ×N . Further weight V and output the weight S A of self-attention. The specific process is as follows: Among them, Among them, ⊗ represents the multiplication of corresponding position elements, out ∈ R H×W ×N .

B. Classification Model
The availability of the method in this article is validated in the image classification task. The network structure shown in Fig. 2 is designed for remote sensing image classification. In this architecture, we use five DSA_Blocks, then use two fully connected layers to reduce the dimension to the number of categories, and finally classify the input image through a softmax operation. Since the size of the input image in this article is 256 × 256, the size of the downsampling pooling kernel is 2 × 2, and the step size is 2. In Layer1, the input channel is 3, the output channel is 32, and the reduction factor s = 32. In Layer2, the input channel is 32, the output channel is 64, and the reduction factor s = 16. In Layer3, the input channel is 64, the output channel is 128, and the reduction factor s = 8. In Layer4, the input channel is 128, the output channel is 256, and the reduction factor s = 4. In Layer5, the input channel is 256, the output channel is 512, and the reduction factor s = 2. We can conclude from the change of the reduction factor channel that when the feature map is relatively large, the reduction factor is relatively large.
The purpose of this is to reduce the amount of calculation of self-attention. At the same time, the receptive field is also different when the feature map is downsampled by different reduction factors, i.e., the regions it represents are different. When calculating self-attention, the global interactive information contained is different. From the large receptive field of the first layer to the small receptive field of the last layer, the range represented by the self-attention weight also changes from large to small, from general to refined, so that the operation similar to the pyramid structure can make the extracted global information more comprehensive.
The convolution operation at the front end of DSA_Block can capture the local information and details of the input features. Through DSA_Block, local information and global information  are effectively fused to accomplish the target of promoting classification accuracy. The classification experiments below also fully verify the effectiveness of the method in this article.

C. Segmentation Model
In addition, we verify the method in this article on the image segmentation task, and we construct a segmentation network for semantic segmentation of remote sensing images, as shown in Fig. 3. The framework adopts an end-to-end encoding-decoding structure. The input image size of the segmentation network is 512 × 512, and the downsampling pooling kernel size is 2 × 2 with a stride of 2. In the encoding stage, similar to the classification structure above, a five-layer DSA_Block module is used. The output channel and reduction factor of each module is the same as the classification network, and it also has the function of fusing global context information and local context information. A 3 × 3 convolution operation is added to the lowest layer to further extract features.
In the decoding stage, we adopt the visual geometry group 16, (VGG16) [33] structure, and the number of 3 × 3 convolutions used to decode the first layer to the fifth layer is two. The specific structure is shown in Table I. "×2" indicates that there are two convolution blocks of this type, and "k = 2" indicates that the convolution kernel used in the upsampling process has a size of 2. And we replace the maximum pooling downsampling layer in VGG16 with a deconvolution upsampling layer.
At the same time, the skip connection method is used to add the shallow information captured by DSA_Block and the deep information of the corresponding decoding layer to further deepen the feature fusion, which alleviates the problem of information loss during the downsampling process of the number of layers and also prevents gradient loss. Finally, the corresponding semantic segmentation map is the output. The remote sensing image segmentation results below strongly prove that the method in this article is very effective in the fusion of global information and local information. While obtaining global information, it  can also extract local details such as object boundaries. This further confirms that the method of fusion of self-attention mechanism in CNN and Transformer is reasonable, and can integrate the superiorities of CNN and Transformer to obtain a basic module that is superior to pure convolution operation or pure Transformer structure.

A. Datasets
In order to confirm the availability of the method in this article, we performed ablation experiments on two classification task datasets (UC Merced [34] and WHU-RS19 [35]) and two segmentation task datasets (Potsdam [36] and LoveDA [37]). The details of each dataset are shown in Table II. The original image size of the Potsdam dataset is 6000 × 6000. The image is too large to be input into the model, so the original image is cropped. The table shows the cropped size. The image sizes of other datasets are official standard data.

B. Setting and Evaluation Metrics
In the classification task, the size of the input image is 256, the batch size is 10, the initial learning rate is 0.001, the learning rate adopts the attenuation strategy with epoch, the Epoch is 400, the stochastic gradient descent (SGD) optimizer is used, the momentum is 0.9, and the weight decay is 5 × 10 −5 . The evaluation indicators use accuracy and mP A.
In the segmentation task, the size of the input image is 512 × 512, the batch size is 4, the initial learning rate is 0.01, the learning rate adopts the attenuation strategy with epoch, the Epoch is 400, the SGD optimizer is used, the momentum is 0.9, and the weight decay is 10 −4 . The evaluation indicators use the accuracy, F-measure (F 1) score, Kappa coefficient, and mean intersection over union (IOU) (mIoU ).

C. Experimental Results
The experimental environment is the Pytorch framework configured on Ubuntu64 operating system configured with Intel(R) Core(TM) i7-7700 K CPU @ 4.20 GHz and GeForce GTX 1080 Ti GPU. For both classification and segmentation tasks, we apply the most basic cross entropy loss function.

1) Image Classification: a) Effectiveness of DSViT module on the UC Merced dataset:
In order to confirm the feasibility of DSA_Block, we visualize the internal convolution output features and self-attention weights of each DSA_Block in the classification network, as shown in Fig. 4. A total of five layers of DSA_Block are used in the classification network in this article, and the first channel of the feature map of each layer is used for visualization. Due to the process of downsampling, the deeper the layer, the smaller the feature map. So as to unify the visualization effect, we resize each feature map to 256 × 256. The first line is the feature map after the two-layer convolution operation in DSA_Block, corresponding to out c in Fig. 1. The second line is the visualization of the self-attention weight in DSA_Block, corresponding to out m in Fig. 1. The third line is the output of DSA_Block, corresponding to the output in Fig. 1. The image in Fig. 4 is a baseball diamond. The main features of identifying a baseball field are the fan-shaped sand and the center of circular sand. In the visualization below, the output of the convolution operation mainly acts on boundary information, such as shape and outline, while the self-attention weight focuses on capturing the fan-shaped area and the circular area and fusing the two features. The result shows that the feature at this time has both boundary information and area information. We perform classification experiments on the UC Merced dataset. A comparative experiment is done with some classic methods [1], [2], [33], [38], [39] and some recent methods based on CNN and Transformer [18], [27], [42], [43], and the experimental results are presented in Table III. We make a comparison in the following three cases: 1) Compared with the traditional pure convolutional network, the method in this article has a net increase of 15% and 17.93% in accuracy and mPA, respectively than the classic AlexNet [1]. Compared with the deep convolutional network DenseNet [40], the net increase in accuracy and mPA is 3.09% and 6.17%, respectively. Compared with the attention-based convolutional network EfficientNet [39], the net increase in accuracy and mPA is 11.91% and 13.57%, respectively; 2) Compared with the pure Transformer-based network, the pioneering method ViT [18] has an accuracy rate and mPA of 87.89% and 83.63%, respectively, and the method in this article improves these two indicators by 8.3% and 9.81%, respectively, compared with ViT. Compared with PVT [42], it improves the performance by 5% and 4.82%, respectively and 3) Compared with the network framework integrating CNN and Transformer, the method in this article improves the accuracy and mPA by 5.24% and 5.49%, respectively, compared with Swin [27]. Compared with the latest method ConvNeXt [43], the capability of the method in this article is improved by 12.57% and 13.85%. Through the above data comparison, the method in this article can feasibly complement the superiorities of CNN and Transformer to accomplish better performance than the pure CNN model and pure Transformer architecture. At the same time, compared with some models that fuse the two, the dynamically scalable attention module proposed in this article can better fuse local information and global information.
b) Acc and confusion matrix visualization on UC the Merced dataset: We confirm the capability of the network after each training epoch. Fig. 5(a) shows the accuracy of some methods in the verification phase. The total amount of training times is 400. We draw a graph of the test accuracy after every ten iterations. We can obverse from the figure that the convergence speed and trend of these architectures are almost the same, which presents that the method in this article alleviates the instability of Transformer, and our method starts to achieve the best performance after 100 iterations, especially after when the network tends to converge. The method in this article still has better functions than these pure CNN, pure Transformer, fusion CNN, and Transformer architectures. In addition, we visualize the test results of the best model of the method in this article through the heat map of the confusion matrix, as shown in Fig. 5(b). The method in this article has accomplished 100% accuracy in 13 categories, only the accuracy rate in the two categories is lower than 90%, and the overall accuracy rate reaches 96%. It can further account for the effectiveness of the method in this article.
c) Effectiveness of DSViT module on WHU-RS19 dataset: So as to prove the generalization of the method in this article, we further conduct comparative experiments on the WHU-RS19 dataset, as shown in Table III. Similar to the experimental results of the UC Merced dataset, on the WHU-RS19 dataset, we achieve classification accuracy of 93.88% and mPA of 90.63%. The method in this article has a greater promotion in performance than the pure CNN model, the pure Transformer architecture, and the combination of CNN and Transformer architectures, which also further illustrates the effectiveness of the dynamically scalable attention module proposed in this article, and it can also be better generalized to other datasets, with high generalization. d) Acc and confusion matrix visualization on WHU-RS19 dataset: Similar to the UC Merced dataset, we also visualize the test acc curve and the confusion matrix heatmap on the WHU-RS19 dataset, as shown in Fig. 6. Fig. 6(a) presents that our method maintains the best performance compared to other methods after 50 iterations until the end of training. In Fig. 6(b), through the calculation of the confusion matrix, the classification accuracy of this method is above 93%, but there is an error in the classification of the River category, this River category is classified into the Forest and Park categories, so the accuracy rate is only 73%. e) Efficiency analysis of DSA Block and ViT Block: We conducted an analysis of the computational efficiency and parameter count for both the traditional ViT Block and the DSA Block is proposed in this article. In the ViT model, the input of the attention module is the same, so the floating point operations per second (FLOPs) and parameter count of each ViT block are the same. However, the input of the DSA block proposed in this article changes with the depth of the model, and the input of each DSA block on each layer varies. Therefore, we calculated the    Table IV. It can be seen that the FLOPs and params of each DSA block are lower than those of each ViT block. As the depth of the model increases and the number of feature channels increases, the FLOPs and params of the DSA block gradually decrease but remain lower than those of the ViT block. This further demonstrates the effectiveness and superiority of the proposed model in this article.

2) Image Segmentation: a) Effectiveness of DSViT module on Potsdam dataset:
We further prove the effectiveness of the dynamic variable attention module proposed in this article on the remote sensing image semantic segmentation task. On the Potsdam dataset, we design the semantic segmentation network as shown in Fig. 3 for fitting experiments, in addition to combining the specific index values calculated and the results are shown in Table V. It can be seen that the method in this article is very good at recognizing the three types of ground features, Building, Car and Impervious surfaces, with an accuracy rate of over 94%. The recognition accuracy of Tree and Low vegetation reached more than 86%. The visualization of each category in Fig. 7 also proves this rule.
Comparing the segmentation model in this article with other classic and efficient methods, as shown in Table VI, the "-" in the   [44], the method in this article improves the accuracy rate and the F1 by 1.24% and 1.76%, respectively. Compared with the pure CNN architecture dense dilated convolutions merging network (DDCM_Net) [26], the accuracy rate is increased by 1.34%. Compared with the method densely connected swin transformer (DC-Swin) [45], which combines CNN and Transformer, the accuracy rate is increased by 0.14%, and the F1 is increased by 0.61%, but the method in this article is reduced by 0.24% on mIoU. Compared with FT-UNetFormer [46], this article improves the accuracy and F1 by 0.14% and 0.56%, respectively. Compared with the latest convolutional network mutual affine network (MANet) [47], our method also has advantages. Compared with the state-of-the-art methods [48], [49], [50], [51], the accuracy and F1 of our method are higher than these several methods. The experimental results show that the method in this article shows better performance in all indicators compared with bi-similarity network (BSNet) [52] and transformers U-Net (TransUnet) [53]. Compared with the latest multiscale channel attention fusion network (MCAFNet) [54] and block-in-block edge detection network (BIBED-Seg) [55] methods, the accuracy has increased by 1.94% and 1.34% respectively. It is 11.32% higher than BIBED-Seg on mIoU. It is 4.06% higher than MCAFNet in F1. Based on the above analysis, the method in this article has better feature extraction ability. Through the comparison of the above data, the method of this article also shows extraordinary performance in the semantic segmentation task, which also shows that the dynamically scalable attention module proposed in this article can be applied to multiple tasks, and further demonstrates this module has a very strong ability in feature extraction and learning. We predict and visualize the test images of the Potsdam dataset, as shown in Fig. 7. The method in this article can effectively extract detailed information such as boundaries, and at the same time can identify and accurately classify classes with very few pixels. This shows that the segmentation framework proposed in this article can not only establish remote dependencies of long sequences (global information interaction) through dynamically scalable self-attention, but also achieve effective capture of local information through convolution operations, thereby achieving more accurate semantic segmentation.
b) Effectiveness of DSViT module on LoveDA dataset: In order to further prove the effectiveness and generalization of the method in this article, we verified it on a new remote sensing semantic segmentation dataset (LoveDA), and calculated the specific index values of each category. The results are shown in Table VII. It can be seen that the method in this article is very good at identifying the three types of ground features: Water, Agricultural, and Building, and the accuracy rate has reached more than 75%. The recognition accuracy of Barren and Road is about 70%. The segmentation of Forest is the worst, and the segmentation accuracy is only 56%. The visualization effect of each category in Fig. 8 also proves this rule.
We compare our method with other state-of-the-art methods, as shown in Table VIII. The official comparison index of this dataset is only mIoU, so there are no other index values in the comparison article, but other related indexes are also calculated in this article. The accuracy of this method, F1, Kappa, and mIoU are 73.32%, 68.44%, 59.17%, and 52.84%, respectively. Similarly, compared with the traditional deep convolutional network DeeplabV3 [44], this article improves mIoU by 7.7%. Compared with the official method HRNetw32 [37], it improves by 0.92%. Compared with DC-Swin [54], our method improves mIoU by  [49], [57], [58], it improves by 0.4%, 4.14%, and 7.58%. We compare the latest four methods, and the results show that the method in this article has also achieved the best results, which further shows that the method in this article has great advantages in the remote sensing image segmentation task, and can better combine the global information and local information to achieve the purpose of accurate segmentation.
Similarly, we visualize the prediction results of LoveDA data, as shown in Fig. 8. It can be observed that the method in this article can extract features from these complex ground objects and classify them, which further proves that the method in this article can be compatible with the superiorities of CNN and Transformer, and effectively integrate the global information collected by Transformer with the local information carried by CNN, so as to achieve the purpose of accurately segmenting different ground objects.

V. CONCLUSION
CNN and Transformer models have accomplished remarkable results in the field of computer vision, but owing to their respective shortcomings and defects can be complementary, which makes the fusion of CNN and Transformer a new research direction. We have effectively fused CNN and Transformer to propose a dynamically scalable attention model that leverages the strengths of both CNN and Transformer. Our method has been validated on four public datasets, and the experimental results show that our approach achieves the best performance. We hope that DSViT will serve as a useful framework for future computer vision tasks. In the future, the remaining challenge of this method is how to reduce the complexity of attention mechanism in Transformer, increase the length of input sequence, and further improve the expression ability of Transformer, and propose a standard framework of convolutional network and Transformer interoperability.