Deep hierarchical transformer for change detection in high-resolution remote sensing images

ABSTRACT Deep learning instantiated by convolutional neural networks has achieved great success in high-resolution remote-sensing image change detection. However, such networks have a limited receptive field, being unable to extract long-range dependencies in a scene. As the transformer model with self-attention can better describe long-range dependencies, we introduce a hierarchical transformer model to improve the precision of change detection in high-resolution remote sensing images. First, the hierarchical transformer extracts abstract features from multitemporal remote sensing images. To effectively minimize the model’s complexity and enhance the feature representation, we limit the self-attention calculation of each transformer layer to local windows with different sizes. Then, we combine the features extracted by the hierarchical transformer and input them into a nested U-Net to obtain the change detection results. Furthermore, a simple but effective model fusion strategy is adopted to improve the change detection accuracy. Extensive experiments are carried out on two large-scale data sets for change detection, LEVIR-CD and SYSU-CD. The quantitative and qualitative experimental results suggest that the proposed method outperforms the advanced methods in terms of detection performance.


Introduction
Change detection in high-resolution remote sensing images has a wide range of applications in tasks such as land use surveys, geographic spatial database updating, and disaster monitoring and assessment, being a main research topic in remote-sensing image processing and analysis (W.Shi et al., 2020).For matched bitemporal remote sensing images, change detection aims to distinguish variations in corresponding pixel pairs.Usually, the areas that have and have not changed are labeled with 1 and 0, respectively (Ke & Zhang, 2021).
The supervised change detection method can exactly specify the change areas of interest via annotated samples and has the advantage of a good detection effect.Therefore, the supervised change detection method has been widely studied.From the perspective of technical methods, the methods can be divided into traditional machine-learning methods and deep learning-based methods (He et al., 2020).The traditional machine learning methods correspond to hand-craft features and the deep learning-based methods correspond to deep features.Traditional machine-learning methods represented by support vector machines and random forests were widely used for change detection in remote sensing images (Bovolo et al., 2008;Feng et al., 2021;Negri et al., 2021).However, owing to differences in imaging conditions of multitemporal remote sensing images, it is difficult to identify the changes from the background.Consequently, low detection performance is generally obtained by directly applying traditional machine learning methods.Moreover, traditional machine learning methods often require complex handcrafted feature extraction to ensure high-performance change detection.Typical extracted features include texture features, statistical features, and spatial structure features (Bai et al., 2014;Celik, 2009;Wu et al., 2014Wu et al., , 2017;;Z. Li et al., 2017), and handcrafted features rely on expert knowledge and require setting many hyperparameters according to the data characteristics to ensure high detection performance.
In recent years, with continuous improvements in computing power and data acquisition, deep learning-based methods have greatly improved change detection accuracy.Therefore, deep learning has received extensive attention for change detection in remote sensing images (A.Zhang et al., 2020;L. Zhang et al., 2016).Unlike traditional machine learning methods, deep learning models can automatically learn to extract abstract features for downstream tasks and obtain higher detection accuracy provided that sufficient labeled training data are available.Deep learning has made great progress in change detection.Most existing methods use a CNN as the backbone for feature extraction.However, the receptive field of a CNN is limited, which results that these models cannot perceive a wide range of contextual information.On the other hand, the selfattention represented by the visual transformer provides a wider receptive field.Nevertheless, the original transformer cannot balance the local and global feature information well and has high computational complexity.In this paper, we propose a change detection method based on a deep hierarchical transformer.To cope with the deficiencies of the original transformer, we limit the self-attention calculation of each transformer layer to local windows with different sizes.To further improve the detection accuracy, we embed features of different scales into a nested U-Net and design a model fusion method.The main contributions of this study can be summarized as follows: (1) We introduce a hierarchical transformer to better extract features from remote-sensing images.We limit the self-attention calculation to the local window, which can make the model pay attention to the local features.At the same time, we use the shifted windows to make information interaction between different local windows, so that the model can also pay attention to the global features.In this way, the encoder can consider both local and global information, thereby improving the change detection accuracy.
(2) To fully use the features extracted at different scales, we concentrate features of different scales and input them into a nested U-Net to complete change detection.Through the convolution layers of the nested U-Net, the extracted feature information can be interactive.More importantly, the nested structure allows features of different scales to be directly used as the input of the decoder.Therefore, in the nested U-Net, information between features at different scales could be exchanged, which makes the model better leverage the features for accurate change detection.
(3) We propose a simple and effective model fusion strategy to improve the final change detection results by fusing the outcomes from augmented image pairs.Experimental results on two public datasets for change detection show that the proposed method can improve change detection, outperforming similar methods.
The remainder of this paper is organized as follows: In Section 2, we introduce the related work.In Section 3, we introduce the proposed method.In Section 4, we report change detection experimental results and the corresponding analyses.Finally, we draw conclusion in Section 5.

Related work
In the following, we will give a brief overview over the deep learning-based change detection methods.The two main areas are CNN-based method and Transformer-based method.

CNN based method
Among deep learning models, CNN is very suitable for extracting deep features of images.Therefore, the change detection method based on deep learning is mainly based on CNN.However, the change detection task needs to use multi-temporal remote sensing images as input, which leads to the inability to directly use the existing CNNs in computer vision.There are two main approaches for applying deep learning to change detection.One approach is stacking bi-temporal remote sensing images and inputting them into a fully convolutional network to obtain the change detection results.Classic fully convolutional networks models include FCN (Shelhamer et al., 2016), U-Net (Ronneberger et al., 2015), DeepLab series (L.C. Chen et al., 2017), PsPNet + (Zhao et al., 2017), and UpperNet (Xiao et al., 2018).The other approach is using the siamese network structure to handle the dual-path inputs for change detection (B.Liu et al., 2021;H. Chen et al., 2020;Khelifi & Mignotte, 2020;Lee et al., 2021;Lv et al., 2020;M. Zhang et al., 2019).For example, a bilateral semantic fusion siamese network is designed for change detection to better map bi-temporal images into the semantic feature domain for comparison (Du et al., 2022).The attention mechanism can improve deep learning by discarding irrelevant information and emphasizing areas that are important to the task, thus improving deep learning performance.Therefore, researchers have explored various attention mechanisms for change detection.For instance, H. Chen and Shi (2020) proposed STANet based on spatiotemporal attention for improving the change detection accuracy.They also constructed the publicly available LEVIR-CD large-scale change detection dataset.Similarly, DASNet adopts a dual attention mechanism to improve the change detection accuracy (J.Chen et al., 2021).H. Cheng et al. (2021) proposed a hierarchical self-attention augmented Laplacian pyramid-expanding network for highly accurate change detection.In addition, models such as highresolution networks (Hou et al., 2021), recurrent convolutional neural networks (RNNs) (Mou et al., 2019;Sun et al., 2020), and generative adversarial networks (Peng et al., 2021) have also been used in change detection to better identify variations.

Transformer based method
CNNs usually have limitations in modeling global dependencies due to the intrinsic locality of convolution operations.The transformer has recently emerged as an alternative architecture for dense prediction tasks due to the global dependencies modeling ability brought by self-attention (Yuan et al., 2022).The change detection task requires the global dependency modeling ability to improve the detection accuracy.Therefore, researchers have made many meaningful explorations for change detection tasks based on Transformer.For example, a transformer encoder-decoder network named BIT is designed to enhance the context information extracted by CNN and expanded the receptive field (H.Chen et al., 2022).ChangeFormer unified hierarchically structured transformer encoder with Multi-Layer Perception (MLP) decoder in a Siamese network architecture to efficiently obtain accurate change detection results (Bandara & Patel, 2022).TransUNetCD designs an end-to-end encoding -decoding hybrid transformer model for change detection (Q.Li et al., 2022).It encodes the tokenized image patches from the convolutional neural network feature map to extract rich global context information.Therefore, TransUNetCD has the advantages of both transformers and CNNs.SwinSUNet designs a pure transformer network for change detection (C.Zhang et al., 2022).

Proposed change detection method
Figure 1 shows a diagram describing the proposed change detection method.First, we input bi-temporal remote sensing images into a deep hierarchical transformer to extract abstract features.Then, we concentrate on the features extracted from the four stages of a feature extraction backbone network and input them into the four stages of a nested U-Net, which finally provides the change detection results.In this section, we detail the transformer, nested U-Net, model training, and model fusion.

Deep hierarchical transformer
In tasks such as image classification, object detection, and semantic segmentation, the quality of the backbone network used for feature extraction greatly affects the model performance (A.Zhang et al., 2020;B. Liu et al., 2018B. Liu et al., , 2019B. Liu et al., , 2021)).Therefore, various feature extraction backbones with excellent performance have been proposed based on CNNs for diverse computer vision tasks.Such CNNs include VGG, ResNet, DenseNet, and HRNet.Using these backbones in change detection may greatly improve its accuracy.However, CNNs have limited receptive fields, impeding the full use of global context information.To address this limitation, we use a transformer as the feature extractor for change detection.
The transformer architecture was originally intended for natural language processing, and its core is the self-attention mechanism that can describe long-range dependencies.We use the transformer to capture global information from images.However, directly applying a global-attention-based transformer to images notably increases the computational complexity.Therefore, we introduce the hierarchical transformer shown in Figure 2 as the feature extraction backbone of the proposed change detection method (Z.Liu et al., 2021).The feature extraction backbone can be divided into four stages.Stage 1 is patch partition, which divides an image into 4 × 4 non-overlapping patches.As a remote-sensing image contains three bands, after each patch is flattened, the dimension of the one-dimensional feature vector is 48.Then, we use linear embedding to resize the feature vector from 48 dimensions to C dimensions.The transformed feature vectors can be regarded as a sequence, which is input into a transformer block to extract semantic features.Stage 1 comprises two transformer blocks and maintains the dimension (C) and number (H/4 × W/4, where H is the height and W is the width of the feature) of feature vectors.To extract hierarchical features, after stage 1, patch merging is used to aggregate features and reduce the number of feature vectors.Specifically, adjacent 2 × 2 C-dimensional feature vectors are merged into one 4C-dimensional feature vector, reducing the number of feature vectors to H/8 × W/8.Then, a linear embedding resizes the feature vector from 4C to 2C, and the feature vector sequence is input into the two transformer blocks of stage 2. The output passes through a patch merging layer and then through the six transformer blocks in stage 3. Similarly, the output of stage 3 passes through a patch merging layer and then through the two transformer blocks in stage 4. The four stages provide a hierarchical feature representation, resembling the hierarchical structure of CNNs that expand the receptive field as the network deepens.Hence, the transformer exploits global and local information.
The original transformer adopts the global selfattention mechanism shown in Figure 3(a), and it is formulated as follows: where Q, K, and V are the query matrix, key matrix, and value matrix, respectively, of dimension d.Thus, the self-attention calculation is applied to both the feature vector corresponding to each image block and the feature vectors of all other image blocks.To extract more representative features, the transformer also adopts a multi-head self-attention mechanism to initialize multiple sets of Q, K, and V.In this study, we used eight sets.To reduce the high computational  complexity of the global attention mechanism, as illustrated in Figure 3(b), we adopt window multi-head self-attention (W-MSA) and constrain the selfattention calculation of each transformer block to a local window.Although this strategy reduces the model complexity, the features between different windows cannot communicate with each other, reducing the transformer's ability to describe long-range dependencies.Therefore, we also adopt shifted-window multi-head self-attention (SW-MSA), as illustrated in Figure 3(c), to shift the windows for a calculation to the lower-right corner, enabling information interaction between different windows.
The transformer block shown in Figure 2 can be formulated as follows: where MLP is a fully connected layer based on a multilayer perceptron to enhance the description of nonlinearity in the data and LN represents layer normalization applied along the channel dimension.

Nested U-Net
After the deep hierarchical transformer extracts features from bi-temporal remote-sensing images, the feature maps of two images obtained from the four stages of the feature extraction backbone are matched along the channel dimension.Then, to fully use the feature information among different layers and improve the change detection accuracy, we input the matched features into the four stages of the nested U-Net shown in Figure 4 and obtain the change detection results.
The main difference between the nested U-Net and the original U-Net is the dense connection between the encoders and decoders.This dense connection allows comprehensive use of feature information at different scales and thereby improves the change detection accuracy.
Let x i;j be the feature map output by node x i;j , where i is the index corresponding to the downsampling layer along the encoder dimension, and j is the index along the skip connection dimension.The output of the feature map x i;j by any node is obtained as follows: where h � ð Þ represents applying convolution and ReLU (rectified linear unit) activation to a feature, φ � ð Þ represents upsampling, and ½ � represents the connection of features.A node for j ¼ 0 accepts only one input from the previous layer of the encoder subnetwork, while a node for j ¼ 1 accepts inputs from two consecutive layers of the encoder subnetwork.Then, nodes for j > 1 accept up to jþ1 inputs, including j outputs from the first j nodes on the current skip connection path and one upsampling output from the next skip connection path.

Model training
Both the focal loss and Dice loss are suitable for imbalanced datasets and can improve the stability and effect of model training.Hence, we combine these functions into the final loss function.The focal loss can be calculated as where p t is the close degree between ground truth y and the prediction ŷ, η t is a weight that controls the contribution of positive and negative samples to the total loss, and ρ is a predefined focusing parameter.The Dice loss can be calculated as where y is the ground truth, ŷ is the prediction of the model, y \ ŷ j j is the intersection between y and ŷ, and y j jþ ŷ j j represents the union of y and ŷ.Extensive studies and experimental results have shown that data augmentation can help improve the effectiveness of model training.For training data augmentation, we applied random horizontal mirroring and vertical mirroring of images as well as random rotation and erasing of pixels for areas of four pixels randomly erased and maximum width and height of 50 pixels for the areas with random erasure.

Model fusion
To increase the change detection accuracy on test remote-sensing images, as shown in Figure 5, we apply four augmentation operations to each original

Experimental results and analysis
The hardware environment for the experiments performed in this study included an NVIDIA A100 graphics card, 40 GB video memory, and 256 GB main memory.The software environment was Ubuntu 18.04, and we used the PyTorch library to implement our method.

Datasets
To verify the effectiveness of the proposed change detection method, we used the LEVIR-CD (learning, vision, and remote sensing change detection (H.Chen & Shi, 2020)) and SYSU-CD (Sun Yat-Sen University change detection) (Q.Shi et al., 2022) large-scale public datasets containing remote sensing images for change detection.
The LEVIR-CD dataset contains 637 remote sensing images of 1024 × 1024 pixels and a resolution of 0.5 m.The dataset includes 31,333 change instances.Following the original splitting method, the numbers of training, validation, and test samples were set to 445, 64, and 128 images, respectively.Due to the video memory limitation, we divided the images into nonoverlapping blocks of 512 × 512 pixels.Thus, the numbers of image pairs used for training, validation, and testing were 1780, 256, and 512, respectively.
The SYSU-CD dataset contains 20,000 aerial remote sensing image pairs of 256 × 256 pixels and a resolution of 0.5 m.Following the original splitting method, the numbers of training, validation, and test samples were set to 12,000, 4000, and 4000, respectively.

Evaluation metrics
To quantitatively evaluate the performance of different change detection methods, we used the F1score, precision, recall, intersection over union (IoU), and overall classification accuracy (OA) as evaluation metrics.The evaluation metrics are calculated as follows: TP, FN, FP, and TN represent the numbers of true positives, false negatives, false positives, and true negatives, respectively.

Parameter settings and analysis
We used the Adam optimizer to train the model over 50 epochs.The training was divided into two stages, with the learning rate set to 0.0001 over the first 30 epochs and 0.00001 over the last 20 epochs.
As we combined the focal and Dice loss functions, their weights directly affect the change detection accuracy.To study the influence of the two loss functions on the change detection results, we fixed the weight of the Dice loss to 1.0 and evaluated weights of the focal loss of 0.5, 1.0, 2.0, and 4.0.The F1 scores obtained from the LEVIR-CD and SYSU-CD datasets are shown in Figure 6.When the focal loss weight is relatively large (e.g. 2, 4), the change detection accuracy reduces.When the focal loss weight is 0.5, the detection accuracy on the LEVIR-CD dataset slightly improves, while that on the SYSU-CD dataset reduces.Therefore, we set the weights of both loss functions to 1.0, which provides balanced performance on both datasets.

Comparative analysis
To verify the effectiveness of the proposed method, we compared it with the conventional DeepLabV3, U-Net, PSPNet, and UpperNet for change detection in remote sensing images.In addition, we implemented the latest MSPSNet (Huang et al., 2021), ISNet (G.Cheng et al., 2022), BIT (H. Chen et al., 2022), and ChangeFormer (Bandara & Patel, 2022) using their publicly available codes.For a fair comparison, the training, validation, and test sets used for the proposed and comparison methods were the same.
The change detection results of the compared methods on the LEVIR-CD and SYSU-CD datasets are listed in Tables 1 and 2, respectively.For a broader comparison, we included the change detection results of the proposed method without model fusion.For the LEVIR-CD dataset, the F1-score, precision, recall, and IoU of the proposed method without model fusion are better than those of the other methods.The precision of our proposal without model fusion is higher than that of similar methods BIT and Changformer.Hence, the proposed method improves the change detection accuracy.When adding model fusion, the proposed method can achieve the highest accuracy in the four evaluation metrics except for OA, further confirming the effectiveness of the proposed fusion strategy.For the SYSU-CD dataset, the F1-score, recall, and IoU of the proposed method with model fusion achieve the highest values, while the precision is lower than that of BIT, and the OA is slightly lower than that of ChangeFormer.In general, the proposed method with model fusion achieves the most balanced detection results, demonstrating its effectiveness and high performance To show the results of different methods, we randomly selected five image pairs from the LEVIR-CD and SYSU-CD datasets for change detection visualization, as shown in Figures 7 and 8, respectively.The figures also show the image pairs and the corresponding ground truths.Panels (d)-(k) show the change detection results of DeepLabV3, U-Net, PSPNet, UpperNet, MSPSNet, BIT, ChangeFormer, and the proposed method with model fusion, respectively.The white area is the correct detection area (TP).The black area is the background area.The red area is the false detection area (FP), and the blue area is the missed detection area (FN).In the second row of Figure 7, the comparison methods show false change detection areas, whereas the proposed method is more accurate.In the fifth row, the comparison methods miss many change areas, whereas the proposed method provides a more complete detection.In Figure 8, the proposed method has fewer false detection and missed detection areas than the comparison methods, further demonstrating the effectiveness of our proposal.

Ablation study
To explore the impact of different data augmentation operations and model fusion on the change detection accuracy, we conducted ablation experiments on the two datasets.The experimental results listed in Table 3 show that the change detection accuracy obtained without data augmentation is the lowest.The two data augmentation operations of horizontal and vertical mirroring and random rotation slightly improve the change detection accuracy.On the other hand, random erasing greatly improves the change detection accuracy by more than 0.5% on both datasets.Random erasing fills a certain area in the image with the same pixel value, thus covering the image information of the area, forcing the model to learn the features outside   the area for recognition, to some extent, avoiding the model falling into local optimization, thus improving the generalization ability of the model.Based on data augmentation, the proposed model fusion method can further improve the change detection accuracy by 0.71% for the LEVIR-CD dataset and by 0.32% for the SYSU-CD dataset.This illustrates the necessity of using data augmentation to train change detection models and demonstrates the effectiveness of model fusion.
To verify the effectiveness of the deep hierarchical transformer, we tested three models with different scales, Swin-tiny, Swin-small, and Swin-base.The network structures of the three models are shown in Table 4.We compare ResNet34, ResNet50, and ResNet101 as test benchmarks, and apply the model fusion strategy designed in this paper to each model.At the same time, we also implement nested U-Net without feature extraction backbone network.Tables 5 and 6 show the detection results of different models on two datasets.First, we can find that the detection accuracy of the model using a feature extraction backbone network is higher than that using nested U-Net alone.The feature extraction backbone network, such as Resnet and Swin, can extract more abstract features of different scales from multitemporal remote sensing images.These features are embedded in nested U-Net to make better use of multi-scale features.Therefore, the nested U-Net using the feature extraction backbone network can improve the change detection accuracy.Comparing the detection accuracy of different models, it can be found that compared with the ResNet, the deep hierarchical transformer used in this paper could significantly improve the detection accuracy, which proves the effectiveness of the method designed in this paper.However, we found that with the increase of the number of parameters of the deep hierarchical transformer, the detection accuracy has a downward trend, so we finally use the Swin-tiny model as the backbone network for feature extraction.Further comparing the accuracy of different models before and after applying the fusion method designed in this paper, we could find that the model fusion method designed could improve the change detection accuracy for different models, which fully demonstrates the effectiveness of the designed method.Finally, we also designed a strategy to fuse the detection results of the three models.The experimental results of the two datasets are also shown in Tables 5 and 6.We could find that the recall has been greatly improved after the integration of the three models, but the corresponding precision has decreased significantly, F1 score has not improved.Considering that training the three models takes more time, and the improvement of accuracy is not obvious, this paper does not adopt such a multi model fusion strategy.

Execution efficiency analysis
To analyze the execution efficiency of the proposed method, we compared our method with two Transformer-based comparison methods (BIT and ChangeFormer).Table 7 reports the number of parameters (Params), floating point of operations (FLOPs), training time per epoch (Training), and testing time (Testing) of different methods on both datasets.The Params and the FLOPs of the proposed method are between BIT and ChangeFormer.Especially, the two comparison methods are trained according to the open-source code 3 , and the batch size during training is 8 or 16, while the proposed method has been optimized to use a larger batch size during training; therefore, the training time corresponding to the proposed method is shorter.The three methods employ the same batch size throughout the model testing phase, but the proposed method includes the model fusion strategy and slightly increases the testing time.In general, the proposed method does not significantly increase the number of parameters and complexity of the model while ensuring the accuracy of change detection.

Conclusions
We propose a hierarchical transformer to improve change detection in high-resolution remote sensing images.Quantitative and qualitative experimental results on two change detection datasets, LEVIR-CD and SYSU-CD, show that the proposed hierarchical transformer for feature extraction combined with a nested U-Net can outperform conventional methods.To further improve the change detection accuracy, we adopt a model fusion method.The experimental results verify that fusion increases the change detection accuracy.Wu, H., Zheng, J., Qi, K., & Liu, W. (2021).
A hierarchical self-attention augmented Laplacian pyramid expanding network for change detection in high-resolution remote sensing images.ISPRS Journal of Photogrammetry and Remote Sensing, 182, 52-66.https://

Figure 3 .
Figure 3. Data processing in various window self-attention mechanisms.

Figure 4 .
Figure 4. Diagram of nested U-Net for change detection.

Figure 5 .
Figure 5. Diagram of model fusion for change detection.
test image: horizontal mirroring, vertical mirroring, 90° rotation, and 270° rotation.The original image pair and augmented image pair are fed into the trained model separately to obtain five change detection results.Then, we undo the mirroring and rotation operations on the change detection results of the augmented images.The final change detection result is obtained by a voting rule on the five change detection results at every position.If two or more of the five change detection results predict a variation, the result at this position reflects a change.

Figure 6 .
Figure 6.Effect of focal loss weight on change detection accuracy.