Semantic Labeling of High Resolution Images Using EfficientUNets and Transformers

Semantic segmentation necessitates approaches that learn high-level characteristics while dealing with enormous amounts of data. Convolutional neural networks (CNNs) can learn unique and adaptive features to achieve this aim. However, due to the large size and high spatial resolution of remote sensing images, these networks cannot analyze an entire scene efficiently. Recently, deep transformers have proven their capability to record global interactions between different objects in the image. In this paper, we propose a new segmentation model that combines convolutional neural networks with transformers, and show that this mixture of local and global feature extraction techniques provides significant advantages in remote sensing segmentation. In addition, the proposed model includes two fusion layers that are designed to represent multi-modal inputs and output of the network efficiently. The input fusion layer extracts feature maps summarizing the relationship between image content and elevation maps (DSM). The output fusion layer uses a novel multi-task segmentation strategy where class labels are identified using class-specific feature extraction layers and loss functions. Finally, a fast-marching method is used to convert all unidentified class labels to their closest known neighbors. Our results demonstrate that the proposed methodology improves segmentation accuracy compared to state-of-the-art techniques.


I. INTRODUCTION
I N recent years, with the continual advancement of remote sensing technology, high-resolution remote sensing satellites have remarkably been utilized, and the resolution of remote sensing images has considerably improved [1]. As a result, understanding detailed, high-resolution remote sensing images has become a significant challenge [2]. Semantic image segmentation, also known as pixel-level categorization, is a critical computer vision challenge and is a vital technology for remote sensing image understanding [3]. In semantic segmentation, the goal is to assign a class label to each image pixel [4]. There are two types of image semantic segmentation methods: conventional and deep-learning-based [5]. Traditionally, machine learning approaches have used handmade features, whereas deep learning ones show higher performance by simultaneously learning feature representation and classifier parameters [3]. Deep learning methods and particularly convolutional neural networks (CNN) were used successfully in solving multiple remote sensing problems. For example, a CNN-based model depending on a downsamplethen-upsample architecture was used by Volpi  mantic labelling of subdecimeter resolution images [6]. An object-based classification technique was integrated with a deep learning model in [7] to improve remote sensing image classification accuracy. Zhao et al. used CNNs to explore semantic segments, and a conditional random field was utilized to model the contextual information between them [8]. A rotation equivariance CNN architecture was used for highresolution land cover mapping in [9]. Bergado et al. [10] presented a single-stage approach that embeds the processing stages in a recurrent multiresolution convolutional network. A self-cascaded network was used to improve labeling coherence using a sequential global-to-local context aggregation method [11]. Marmanis et al. [12] presented an end-to-end trainable deep convolutional neural network for semantic segmentation with built-in knowledge of semantically important boundaries. Sun et al. [13] suggested ensemble techniques and a residual architecture for encoder-decoder models to mitigate the negative effects of structural stereotypes and address the issue of insufficient learning. Mi et al. [3] used a differentiable decision forest for remote sensing semantic segmentation. Zhong et al. adapted the conventional FCN-8s/16s/32s models to extract roads and buildings from remote sensing RGB images [14]. Audebert et al. [15] suggested and evaluated their FCN-based semantic segmentation approaches utilizing IRRG and DSM as input data. A residual dense U-Net was proposed for pixel-wise sea-land segmentation in complex and high-density remote sensing images [16]. Symmetrical denseshortcut U-Net was used to segment high-resolution remote sensing images [17]. The DeepLab semantic segmentation model and object-based image analysis were used to segment high resolution remote sensing images in [18]. Although the powerful representation capabilities of deep learning and CNN-based methods have aided in developing semantic segmentation of high-resolution images, achieved results remain far from optimal. Convolutional segmentation models rely on learnable convolutions that extract semantically significant features. Unfortunately, the local scope of convolutional filters restricts access to the relationships between distant image pixel intensities. Global features are particularly critical in remote sensing segmentation because the labeling of image patches is frequently dependent on the global context. To circumvent this limitation, DeepLab models [18] use dilated convolutions and spatial pyramid pooling. This enables the expansion of the receptive fields of convolutional networks and the extraction of multiscale information. Nevertheless, convolutional backbones remain biased toward local interactions, and fundamental changes in network architecture are required to solve this issue [19]. Recently, transformer models have gained immense interest in solving computer vision tasks due to their effectiveness [20]. Transformers can record global interactions between elements in a scene. However, simulating global interactions has a quadratic cost, making such techniques prohibitively expensive computationally when applied to raw picture pixels. Despite the high performance shown by transformers in several computer vision problems, they have drawbacks that limit their capability. For example, transformers do not learn to attend locally in earlier layers, while incorporating local information at lower layers is vital for strong performance [21].
Here, we propose an end-to-end fusion framework that combines U-Nets and transformers. Fig. 1 describes the architecture of the proposed ensemble model. The U-Net component models the dense connections between pixels, while the transformer-based component models the context using a token-based technique. local relationships between pixel intensities, while transformer features emphasize global interactions. The developed segmentation system consists of four major parts: an input fusion layer, a transformer-based network, an EfficientUNet network, and an output fusion layer. In the input fusion layer, features representing the multimodal input data: image content, and elevation maps (DSM) are extracted. The fused features are processed in parallel using the U-Net model and a transformer model. Finally, the resulting feature maps are passed through a novel multitask segmentation strategy that identifies class labels using classspecific feature extraction layers and loss functions. The use of these class-specific features and loss functions was shown to further improve the performance of the network.
The contributions of this paper can be summarized as follows.
• An efficient mixture model EfficientUNetTransformer is developed for the semantic segmentation of highresolution remote sensing images. In this model, we combine transformers and the U-Net network to better represent global and local contexts, leading to more consistent labeling outcomes in complex urban constructions. • A novel multi-task segmentation approach that identifies class labels using class-specific feature extraction layers and loss functions • Extensive experiments on two publicly available datasets demonstrate the performance of the proposed model. The proposed ensemble model yields higher accuracy than the purely convolutional equivalent and outperforms several recently proposed attention-based semantic segmentation algorithms. The remainder of this paper is organized as follows. Section II describes the proposed model in detail. Section III presents experimental results and discussions, while Section IV summarizes the conclusions.

II. PROPOSED MODEL
A flowchart of the proposed model is presented in Fig. 1. The developed segmentation system consists of four major parts: 1) a transformer-based network that can extract global high-level semantic characteristics from the input image, 2) an EfficientUNet network that focuses on the extraction of local features from the input image, 3) The input fusion network that merges the IRRG or the RGB image with its digital surface model (DSM) image, and 4) the output fusion layer that splits the sum of local and global features into six separate binary sub classes by passing them through tokenizers and transformers. In what follows, we will provide a detailed description of the different components of the system.

A. Input fusion layer
We propose a new input fusion layer that combines the IRRG or RGB image with the corresponding DSM image. This module passes both IRRG/RGB and DSM images through a sequence of convolution, batch normalization (BN), and ReLU blocks to extract high resolution features, then a dot multiplication is applied between the two obtained feature maps followed by BN layer. Finally, an addition operation is applied to the resulting feature map and the original IRRG/RGB image. These details are illustrated in Fig. 2.

B. Transformer path
An EfficientNet B7 [22] deep neural network architecture is used to extract relevant features from image patches. First, we removed the last stage (i.e., the head stage) from the original EfficientNetB7, which originally contained nine stages. Next, input features are tokenized and sent to a transformer encoder. The tokenizer ( Fig. 1) takes the extracted features X ∈ HxW xC , where H, W, and C are the input feature's height, width, and channel dimension, and divides them into where N f is the size of the vocabulary set of tokens. We use a point-wise convolution across the channel dimension to produce N f semantic groups for each pixel on EfficientNet features, with each group denoting one semantic idea (Fig. 3). A softmax function is applied to the HW dimension of each semantic group, to compute the feature maps F . Similarly, another point-wise convolution is used to produce N a semantic groups for each pixel in the EfficientNet features, with each group denoting one semantic idea (Fig. 3). We set the value of N a to equal the number of segmentation labels. A softmax function is applied to the HW dimension of each semantic group to compute the attention layers A.
The semantic tokens T j , j = 1, · · · , N f = 32 are computed using the following equation: where φ(.) is a point-wise convolution with learnable kernels W 1 ∈ 1x1xNa and W 1 ∈ 1x1xN f , and σ(.) is the softmax function. Each token T j is of size HW 6. EfficienetNet features maps are of size 65 × 65 × 32. The values of N a = 6, and N f = 32 are used in the experiments of this paper. The transformer encoder [23] is composed of encoders that translate a series of patch embeddings to pixel-level class labels. It has L D = 6, layers of multi-head self-attention (MSA) and feedforward (FF) blocks (see Fig. 4). At each layer l , the input to self-attention is a triple (query Q, key K, value V) computed from the input T (l−1) . Unlike the original transformer that uses the post-norm residual unit, we apply the layer normalization immediately before the MSA/MLP. The MSA unit can be described as: where qW q j , kW k j ,vW v j are the linear projection matrices, and h is the number of attention heads. The multi-head attention block receives three components-the query Q, the key K, and the value V -to compute the self-attention output where d represents the channel dimension of the three components and σ is the softmax function applied to the channel dimension. Finally, the output of the transformer is upsampled to match the dimension of features extracted from the Efficien-tUNet network. The transformer decoder comprises L D = 6 layers of multi-head cross attention (MCA) and FF blocks. The encoder's patch-level encodings are mapped to patchlevel class scores by the decoder. The decoder and encoder configurations are equivalent. The MCA receives the query from the extracted features X, the key, and the value from the tokens T generated by the transformer encoder (Fig. 4).

C. EfficientUNet
The second part of the proposed model is U-Net segmentation model that uses an architecture from the EfficientNet family of networks as a backbone. EfficientNet image classification models apply a compound-scaling approach that consistently adjusts the network depth, width, and resolution for increased performance using a given set of scaling parameters [22]. Scaling the network incrementally increases model performance by balancing the architecture's breadth, depth, and image resolution compound coefficients. EfficientNet is built using mobile inverted bottleneck convolution (MBConv), as shown in Fig. 5b. The proposed model uses the swish activation function [24] instead of the widely used rectifier linear units (ReLUs). When going from EfficientNetB0 to EfficientNetB7, the depth, width, resolution, and model size increase, while the accuracy improves [22]. The architecture used in the proposed model is the EfficientNetB7 which has 55 basic building MBConv blocks as shown in Fig. 5a. The components used in these blocks are shown in Fig. 5b. The proposed U-Net encoder is based on the EfficientNetB7 model. The decoder is constructed using a reversed version of the encoder model with upsampling units. The encoder outputs of layers 3, 10, 17, 27 are concatenated with their corresponding decoder outputs, as shown in Fig. 1. Transposed convolution layers were employed to build the decoder, which doubled the size of a feature map while decreasing the number of channels by half. An upsampling layer followed by Double convolution layers were applied after each concatenation operation. Double convolution layers apply the sequence of convolution, batch normalization, and ReLU operations two times.

D. Output Fusion layer
This module is composed of a CNN that receives as input the sum of the final feature mappings from the two deep networks: the transformer and EfficientUNet. Input feature maps were summed and fed into a shallow CNN, which is a Double convolution layer with 32 input channels and 6 output channels. The CNN's output was sent to six separate tokenizers plus transformer encoders, representing the six binary classes, and then passed through logarithmic softmax functions (Fig. 6). Next, the logistic loss was used as the loss function and was computed using a Logarithmic Softmax layer and averaged across the entire patch [15]: where N is the number of pixels in the input image, k equals 2 classes, and for each pixel i , y i andŷ i are the true and predicted labels, respectively. The different parts of the networks were trained together to find the best model that can predict the right labels without using special pre-or post-processing operations. The semantic segmentation maps are obtained by combining the six binary classes using the Add&Inpaint layer (Fig. 6). This layer includes all classes computed in the output fusion layer in the output segmentation map and then replaces misclassified pixels by their nearest classified neighbor using a fast-marching method (FMM) [25]. Challenge in Vaihingen [26]. It consists of three bands of infrared, red, and green (IRRG) image data and digital surface model (DSM), and normalized digital surface model (NDSM) data [27]. In all, there are 33 images with a ground sampling distance of 9 cm in the image data. We divided the 16 pictures with available ground truth into 12 images for a training set and 4 images ("5", "21", "15", and "30") for the validation. We utilized all 16 images as the training set for the test.

III. EXPERIMENTAL RESULTS AND DISCUSSIONS
b) ISPRS Potsdam Challenge Dataset: The second dataset used in this study belongs to the Potsdam ISPRS 2D Semantic Labeling Challenge [26]. It is made up of four-band infrared, red, green, and blue (IRRGB) image data and matching DSM and NDSM data. Of the 38 images of 5-cm resolution, 24 images had the ground truth available, while the remaining 14 were kept by the challenge organizer for testing. From the 24 images given by the challenge organizer, we selected 17 images for training and 7 images ("3 11", "3 12", "4 11","5 10", "6 9", "6 12", "7 11") for validation. We utilized all 24 images as a training set for the test. In both datasets, patches of size 256 × 256 pixels were extracted from the images in the dataset using a sliding window with a stride value of 32.
Proposed system was implemented in PyTorch. All of our models were trained using stochastic gradient descent (SGD) with a base learning rate of 0.01, momentum of 0.9, weight decay of 0.0005, and batch size of 10. The encoder-decoder weights were randomly initialized. We divided the learning rate by 10 after 25 and 45 epochs (out of a total of 100 epochs used for training). We present the overall pixel-wise accuracy (OA), the average F1 score, and the Cohen's kappa coefficient κ across all classes to quantitatively evaluate performance. The F1 score and kappa coefficient κ for a class i are defined as follows: where tp i is the number of true positives for class i, C i is the number of pixels in class i, P i is the number of pixels assigned to class i by the model, p o is the relative observed agreement among raters, and p e is the hypothetical probability of chance agreement. In compliance with the competition organizers' assessment guidelines, these metrics were derived after eroding the boundaries with a three-pixel radius circle and deleting those pixels [15]. Performance comparison We compared the performance of the proposed model against several recently proposed models using validation and testing datasets.
The results obtained for the Vaihingen challenge dataset are presented in Fig. 7 and Table I, and those for the Potsdam challenge dataset are given in Fig. 8 and Table II. One can see in Fig. 8 and Table I that the UNet, FCN-8s, and PSPNet models achieve low-quality results compared to the SegNet, Transformer, and the proposed model. For example, all convolutional models (i.e., UNet, FCN-8s, PSPNet, and SegNet) show low performance in car segmentation; however, the transformer-based models (i.e. transformer alone or our fusion models) classify that category accurately (improvement by more than 3%). This result demonstrates the superior ability of transformers in representing dynamically changing object classes. The late fusion between the EffUNet and the transformer improved the kappa and overall accuracy by around 1% compared to the transformer alone or SegNet. We evaluate the model without the semantic tokenizer to see its influence on prediction quality. The results demonstrate its effectiveness, especially in labeling objects from the low vegetation and cars classes, in which we observe a difference of about 3% in local accuracy (see Table I).
The results of the Potsdam dataset show that the transformer model alone improved the results by more than 2% in kappa and by around 1% in total accuracy compared to the SegNet model, which achieved the closest results. The fusion adds 1% to the overall accuracy. The transformer alone and fused with EffU-Net each improve the tree's classification quality by around 4% compared to the conventional models, such as SegNet or UNet. To demonstrate the ability of the proposed model when dealing with unclear objects, we took a sample that contained such objects (Fig. 9). We can see that the proposed model predicts these objects well, compared to other models.
ISPRS benchmark dataset: The proposed model was tested using the testing dataset available on the website of the ISPRS challenge [26]. The performance of the model was compared with the following participants in the challenge: 1) SegN et+DSM +N DSM (ON E 7) [15]:The authors used the late fusion of two trained SegNets using the IRRG image and the composite image that contained NDVI, DSM, and NDSM.
2) Self − cascaded + ResN et(CASIA2) [11]: A single self-cascaded network with the encoder based on a variant     of a 101-layer ResNet [38]. The authors used only the 3band IRRG images to predict the segments, which makes their model computationally more tractable.
3) CN N + HCF + CRF (ADL 3) [33] : The model used a CNN to extract the image features to produce per-pixel category probabilities; then, a conditional random field (CRF) was applied as a post-processing step to find the predicted labels. : IRRG and DSM data were used as inputs to the combined model based on the FCN trained with no downsampling and random forest to find the output features; next, the CRF is used as a postprocessing step.
6) Gated segmentation network GSN 3 [35]: A gated segmentation network was proposed. ResNet-101 was used as the feature extractor in the encoder portion, and the entropy control module was used for feature fusion in the decoder. A residual convolution module (RCM) was employed as the basic processing unit. 7) CN N + N DSM + Deconvolution (U Z 1) [36]: The model is comprised of a CNN that has been trained to learn a series of downsampling (a regular CNN) and a sequence of nonlinear upsampling blocks using deconvolutions back to the original input size.
8) Dilated Convnet U F M G 4 [6]: The authors proposed a series of dilated convolutions [50]. The primary concept is to train a dilated network with different patch sizes to collect multi-context features from diverse contexts. 9) SegN et + F CN (RIT 7) [37]: In their model, SegNet was fused with an FCN for pixel-wise semantic classification.
10) LAN et [38]: The authors proposed the local attention network to improve the semantic segmentation of RSIs by enhancing the scene-related representation in both encoding and decoding phases. 11) Swin − B − CN N + BD [39]: The swin transformers and CNN were used as encoder and decoder for remote sensing segmentation. The CNN is applied to recover the size of the feature maps and acquire the semantic segmentation results backbone.
12) ResN eSt [40]: Attention-Residual block-Embedded Adversarial Network was investigated to learn local-to-global contextual information through semantic and position information improved collection.
13) M F − DF N et [41]: A multiscale feature and discriminative feature network was proposed to resolve the issues of intraclass inconsistency and the difficulty in locating and identifying the target.
14) DGCR [42]: A dynamic graph contextual reasoning module over global reasoning networks was presented for capturing long-range dependencies in feature representations.
15) Swin − S [43]: The Swin-transformer and the densely connected feature aggregation module were used as encoderdecoder, respectively, to improve the remote sensing semantic segmentation accuracy.
16) CEGF N et [44]: An end-to-end common extraction and gate fusion network was proposed to solve the problem of misclassification of small objects.
17) G2GN et [45]: To calibrate the RGB responses for improved feature representation, the informative features from the RGB and auxiliary data were adaptively gathered using a self-adaptive attention mechanism.
18) AF N et [46]: The multiscale and multilevel maps based CNN were combined for remote sensing semantic segmentation.
19) SBAN et [47]: To extract full and crisp borders from complicated very-high-resolution remote sensing images, a semantic boundary awareness network was developed. 20) HU ST W 3 [48]: The authors developed a residual architecture for encoder-decoder models to address the issues of inadequate learning and receptive field imbalance faced by encoder-decoder models Table. III shows that the proposed fusion model outperforms existing methods on the Vaihingen dataset, with a testing accuracy of 91.5% when using IRRG and 91.8% when combining IRRG and DSM. The proposed approach achieved the highest accuracy (91.8%) on the Potsdam dataset when using only RGB images, and 92.9 % when combining RGB with DSM (Table. IV). We can see that elevation data increased the accuracy only slightly, which makes the RGB image alone preferable for use in systems with computational constraints.

IV. CONCLUSION
In this paper, we proposed a novel fusion deep learning model for investigating the semantic labeling of multi-modal ultra-high-resolution urban remote sensing data. We showed that the fusion of deep transformers and conventional neural networks (i.e., the U-Net model) is an effective method for recognizing the relationships between objects and scenes, leading to consistent labeling outcomes for complex urban objects.
Extensive experiments on two publicly available challenging datasets demonstrate the proposed model's efficacy and efficiency. Proposed model was shown to be more consistent and yields more accurate labeling outcomes than existing frameworks.