Change detection using multi-scale convolutional feature maps of bi-temporal satellite high-resolution images

ABSTRACT Change detection in high-resolution satellite images is essential to understanding the land surface (e.g. agriculture and urban change) or maritime surface (e.g. oil spilling). Many deep-learning-based change detection methods have been proposed to enhance the performance of the classical techniques. However, the massive amount of satellite images and missing ground-truth images are still challenging concerns. In this paper, we propose a supervised deep network for change detection in bi-temporal remote sensing images. We feed multi-level features from convolutional networks of two images (feature-extraction) into one architecture (feature-difference) to have better shape and texture properties using a dual attention module We also utilize a multi-scale dice coefficient error function to decrease overlapping between changed and background pixel. The network is applied to public datasets (ACD, SYSU-CD and OSCD). We compare the proposed architecture with various attention modules and loss functions to verfiy the performance of the proposed method. We also compare the proposed method with the stateof-the-art methods in terms of three metrics: precision, recall and F1-score. The experimental outcomes confirm that the proposed method has good performance compared to benchmark methods.


Introduction
Change detection is a process to identify disparities in the state of an object from different images of the same area at different times.Monitoring differences has been widely applied for various applications such as urban expansion, vegetation mapping, sea ice, surface water, disaster assessment, planetary surface, etc. (Chen et al. 2019;Parente et al. 2019;Kaiyu et al., 2020;Chen et al. 2020;Mohsenifar et al. 2021;Zhao et al. 2022).Satellite images have been widely used to observe variations in shape and texture properties.However, change detection is still a challenging problem because of the massive amount of digital Earth observations that vary in spatial resolutions from kilometers to centimeters from all kinds of satellite sensors such as Landsat, Worldview and DeepGlobal.Also, many remote sensing studies suffer from the unavailability of labeled observations to train efficient machine learning models.
Change detection methods are categorized into two approaches.The first approach is pixel-based, which is based on the comparison of corresponding pixels from multi-temporal images to produce change maps based on arithmetic operations such as image difference, image ratio, etc., or transformation operations such as principal component analysis, canonical correlation analysis and change vector analysis, etc. (Hussain et al. 2013;Liu et al. 2019).However, pixel-based methods neglect spatial contextual information and unsupervised separate changed pixels from unchanged pixels.The second approach is patch-based, which is based on deriving features from patches or segmentation maps.However, patch-based methods applied in low-or middle-resolution images fail to work in high-resolution images because of the variability of image objects.The classical patch-based methods are mainly based on applying traditional machine-learning-based techniques (e.g.support vector machine, clustering, kernel regression, etc.) after extraction of hand-crafted features (Dengkui et al., 2008;Celik 2009;Luppino et al. 2018).The recent patch-based methods are based on deep learning techniques such as deep belief networks, autoencoder, etc. Zhang et al. (2016); Lei et al. (2019); Rostami et al. (2019).
The previous studies used unsupervised methods by transform or arithmetic operation or unsupervised machine learning.Recently many works use supervised change detection methods (Peng, Zhang, and Guan 2019;Sherrie et al., 2020;Zhang et al. 2020;Chen et al. 2022).The supervised methods have many advantages.First, using a training process based on massive labeled data helps to create a robust model.In particular in the case of an imbalance problem, which is the case in detecting changes in satellite images; the number of changed pixels is very small compared to unchanged pixels.Also, in case of detecting changes in fine image details and complex texture features in high-resolution images Zhang et al. (2020).In convolution neural networks, supervision improves the learning ability to extract multi-scale features from input raw images based on labeled image samples (Peng, Zhang, and Guan 2019;Zhang et al. 2020;Kaiyu et al., 2020).In addition, it introduces change-detection loss in intermediate layers (Zhang et al. 2020).Second, supervised learning produces good model performance with higher evaluation scores (specificity, sensitivity, precision, etc.) (Goswami et al. 2022).
Nowadays, supervised deep networks are applied using two approaches.First, single network architecture is used to extract multi-scale features based on arithmetic operations between two bi-temporal images (Daudt, Bertr, et al., 2018).Second, two parallel network architectures are used to extract multi-scale features from each image (Daudt, Bertr, et al., 2018;Chen et al. 2020;Zhang et al. 2020).In this paper, we use a supervised patch-based deep method.The method has three parts; two parts to extract features of two sequence image patches and one part to differentiate between change and unchanged image patches in high-resolution images.This paper is summarized as follows: • It uses end-to-end architecture: encoder to extract multi-scale features from two sequence images and decoder to differentiate between learned features.• It integrates feature maps from the same convolutional layer into dual attention maps (DAM) that concentrate on the spatial and channel difference of combined feature maps.• It uses the Dice Coefficient as an error function between multi-scale predicted probability change maps and multi-scale reference change maps.
The rest of this paper is organized as follows.Section 2 presents some of the related works.Section 3 describes the proposed method.Section 4 shows the experimental results and important findings.Section 5 summarizes this paper.

Related work and problem definition
There are three deep approaches for detecting the changes from satellite images: early fusion, late fusion and the combination of early and late fusion.Fully convolutional-early fusion (FCEF) is one of the benchmark deep change-detection methods that fuses early the difference between bi-temporal images.It shares low-level features using skip-connection but fails to provide details of individual raw images.Mainly, the output changedetection maps have irregular object boundaries and lower object compactness (Daudt, Bertr, et al., 2018).Caye Daudt et al. (2018) proposed fully convolutional Siamese-concatenation (FCSC) and fully convolutional Siamese-difference (FCSD) that solve the weakness of the previous method.Firstly, both methods apply a Siamese encoding stream to extract deep features from bitemporal images and then combine the extracted deep features on the decoding stream to produce a change detection map.The difference is that the FCSD is based on the difference between in-depth features from the encoding stream.In contrast, the FCSC depends on the concatenation of in-depth features from the encoding stream.The back-propagation is performed from featurediscrimination/difference layers (decoder) to featureextraction layers (encoder).Chen et al. (2020) proposed a deep Siamese multiscale convolutional network (DSMSCN) architecture using multi-scale feature convolution units (MFCU) layers to extract multi-scale spatial and spectral features from raw images before the feature-difference stage.These methods may produce uninformative features and poor image qualities.To fix the problem, many studies concatenate on raw image features and image difference features; however, the main concern is how to effectively combine features.Zhang et al. (2020), one of the latest studies, used the image difference feature (IDF), dual attention module (DAM), spatial attention module (SAM) and channel attention module (CAM) to integrate the rawimage feature (encoder stream) into the featuredifference (decoder stream).
In this work, we extract multi-scale features from convolutional layers (encoder) into feature-difference layer (decoder) to acquire better change maps with accurate structures capturing variations in pixel-level (e.g.intensity) to region-level (e.g.shape and texture) to object-level.We also use a dual attention module based on spatial and channel modules.We add differencefeature and difference-image to improve the output from the feature-difference stage.

Method
The network architecture and loss functions are presented in Section 3.1 and Section 3.2.

Network architecture
The network architecture has three branches I, II and III, as shown in Figure 1.The branch I presents network architecture (encoder I) of the first input image X T 0 (dimensions = W � H � C) at time T 0 ; where W, H and C are width, height and number of channels.The branch II presents network architecture (encoder II) of the second image X T 1 (W � H � C) at time T 1 .Both branches present a feature-extraction stage (Figure 1-a) from X T 0 and X T 1 .The convolutional feature maps of both branches I and II are fed to branch III.Branch III presents network architecture (decoder).The first input of branch III is feature maps of the latest convolutional layers of branches I and II.The branch I and II present the feature-extraction stage (Figure 1-a) and branch III presents the featuredifference stage (Figure 1-b).The feature maps from the same convolutional layers of branches I and II are combined into one dual attention module (DAM) to produce multi-outputs in branch III.Each DAM consists of a spatial attention module (SAM) and a channel attention module (CAM) (Jun et al., 2019).The final binary change-map is a result of aggregating all DAM outputs after upsampling to the scale of input images X T 0 and X T 1 .In Section 3.1.1,3.1.2and 3.1.3,we will illustrate all processes to produce SAM and CAM and then DAM outputs.

Spatial Attention Module (SAM)
We use the SAM to increase the distance between changed and unchanged pixels in difference-maps of feature convolution maps of the branch I and II (Figure 2).The input map into the SAM is M conv (W � H � C); which is a combination of three random difference-maps between feature-maps of same convolutional layers (M conv;T 0 and M conv;T 1 ).To produce SAM map M sam , the M conv is fed in pooling, summation and multiplication operations.First, it is fed into maximum-pooling and average-pooling operations to produce M max (W � H � 1) and M avg (W � H � 1).Second, both avg-matrix and maxmatrix are wise-element summed into M i (W � H � 2).Third, the sum-matrix is fed into  a convolution operation along the channel axis to produce M ii (W � H � 1) (Eq.1); which is activated using the sigmoid function to generate M sam .Finally, the SAM output is M sam (Eq.2) is element-wise multiplied with M conv to refine feature maps, where changed pixels are emphasized by multiplying with higher weights while unchanged pixels are suppressed by multiplying with lower weights. (1) where C is a convolution operation with filter size 5 � 5, � is an element-wise summation process, σ is a sigmoid function and � is a element-wise multiplication process.

Channel Attention Module (CAM)
We use the CAM to emphasize target-relevant channels while suppressing target-irrelevant channels (Figure 3).The input into the CAM is the same input into SAM M conv .To produce CAM output M cam , the M conv is fed in reshape,transpose, summation and multiplication operations as follows.
where � is a matrix multiplication operation and � is an element-wise sum operation.

Dual Attention Module (DAM)
The DAM output is a result of element-wise sum operation of M sam and M cam after multiplication with the last difference-maps of feature convolution layers (M conv;FÀ 1 ) of branches I and II, where F is number of feature maps in one convolutional layer.
The final binary change-map Ŷ (Eq.7) is a result of maximum operation of all M dam element-wise multiplied by difference-image between input images X T 0 and X T 1 of branch I and II.
where C and σ are 1 � 1 convolution and sigmoid functions.

Loss function
We use a multi-scale error function between the predicted probability change-map from each layer I of the branch III Ŷl and its corresponding reference change-map Y l of the same dimension.We use binary cross-entropy function (Eq.8) and weighted dice-coefficient function (Eq.9).We use a binary cross-entropy function to measure error of each pixel in ground-truth change-map Y and its corresponding pixel in predicted probability change-map Ŷ (Eq.8).
where y represents the ground truth value of the pixel; y ¼ 1 if the ground-truth pixel belongs to the changed class.Otherwise y ¼ 0. ŷ represents the predicted probability of the pixel belonging to the change class. 1 À ŷ presents the probability of a pixel belonging to the unchanged class.N is the number of image samples.
We use the weighted dice-coefficient function because it is effective for the class-imbalance scenario, which is the case in the change-detection problem; the number of changed pixels is very small compared to the unchanged pixels: where L is the number of layers in branch III.y and ŷ present each pixel in ground-truth change-map and probability change-map.The total loss of the network is a combination of two functions, as follows: (10)

Performance
In this section, we describe the used satellite datasets in Section 4.1, evaluation metrics and training parameter settings in Section 4.2.We evaluate the network design and experimental results in Section 4.4 and Section 4.5.

Onera Satellite Change Detection-OSCD
The OSCD dataset comprises 24 pairs of multispectral images from the Sentinel-2 satellites between 2015 and 2018 (Daudt, Bertr, et al., 2018).The pairs of multi-spectral images are picked worldwide, in Brazil, USA, Europe, Middle-East and Asia.Each image consists of 13 bands.Images vary in spatial resolution between 10 meters, 20 meters to 60 meters per pixel.We use 20,000 patches of size 256 � 256 (12000, 3000 and 5000 pairs as training, validation and testing images, respectively).The annotated changes focus on urban changes, such as new buildings or new roads.This dataset can be downloaded from https:// rcdaudt.github.io/oscd/.

Evaluation metrics
To evaluate the performance, we measure precision (positive predictive value (PPV)) (Eq.11), recall or sensitivity (true positive rate (TPR)) (Eq.12), specificity (true negative rate (TNR)) (Eq.13) and F1-score (Eq.14) after testing five times in all test sets, where TP, FN, FP and TN are the numbers of changed pixels correctly classified as changed pixels, the number of changed pixels classified as unchanged pixels, the number of unchanged pixels classified as changed pixels, and the number of unchanged classified correctly respectively Maxwell et al., (2021).We also evaluate the network design (attention modules and loss functions) based on the average of intersection-over-union IoU.The IoU is defined as an area of intersection of the predicted change map Ŷ with the ground-truth map Y divided by the area of the union between Ŷ and Y (Eq.15).
We compare the proposed method with the state-ofthe-art methods: • FCEF Daudt, Le Saux, et al., 2018: the first step of this network is image fusion; two image pairs at T 0 and T 1 are concatenated as an input image into the Siamese network.• FCSD Daudt, Bertr, et al., 2018: two parallel Siamese network streams are used to extract features from the input image at T 0 and input image at T 1 (encoder).The output maps of the second stream are subtracted from the output maps of the first stream to produce inputs to the third network stream (decoder).The output map of the third stream is the probability change map.• FC-Siam-Con Daudt, Bertr, et al., 2018: similar to the FCSD, it uses two parallel Siamese network streams.However, the output maps of the second stream are summed into the output maps of the first stream to produce input maps for the third stream (decoder).

Training and parameter setting
The first step of detecting changes in satellite images is pre-processing stage including radiometric normalization (e.g.IRMAD (Canty and Nielsen 2008) and key point-based RRN (Moghimi et al. 2021(Moghimi et al. , 2022))) and coregistration.The ACD and SYSU-CD image-pairs are already radiometric normalized with zero mean and unit variance (Chen et al. 2020).The OSCD imagepairs are radiometric normalized and co-registered using GEFolki toolbox Brigot et al. (2016); Daudt, Bertr, et al., 2018.We use ResNet architecture for parallel branches I and II.We train the model with 5000 epochs and a batch size of 16.We use Adam optimizer (Kingma and Jimmy, 2015).The learning rate is initiated at 0.0001.It is multiplied by the learning rate decay, which is empirically set to 0.2, if loss stops decreasing after 10 epochs.We use Keras 2.2 with Tensorflow 1.9 with a high-performance computing (HPC) server with Nvidia Tesla V100 GPUs to run all experiments.For FCEF, FC-Siam-Diff, FC-Siam-Con, DSMSCN, NestNet2 and DSIFN architectures, we use same parameter settings used in Daudt, Bertr, et al., 2018;Daudt, Bertr, et al., 2018;Chen et al. (2020); Li, Li, and Fang (2020); Zhang et al. (2020).

Ablation study for attention modules and loss function
To verify the performance of the attention modules and loss functions, we conduct experiments with different settings in the SYSU-CD dataset, as shown in Tables 1, 2 and Table 3.We build various attention architectures with and without difference-map, and with and without multi-scale dice-coefficient error functions.We use TPR and TNR metrics because IoU may be biased.

Dual attention module maps with difference-images vs. loss functions
The network architecture based on the mean of DAM maps-wise produced with the difference-image (M ðM dam Þ � M DðT 0 ;T 1 Þ ) improves the performance remarkably with approximately 3% IoU, 3% TPR and 6% TNR compared to the architecture with only last DAM map wise-produced with the differenceimage (M dam � M DðT 0 ;T 1 Þ ), as shown in Table 1.
Employing the multi-scale dice-coefficient function in addition to binary cross-entropy function (E C ðY; ŶÞ þ E D;L ðY; ŶÞ) (1st row) enhances the performance because it reduces the overlapping rate between hierarchical structures in change binary maps with around 3-8% IoU, 2-11% TPR and 3-12% TNR improvements compared to other error functions; binary cross-entropy error and dice-coefficient error of the last layer (E C ðY; ŶÞ þ E D ðY; ŶÞ) (2nd row), sum of the multi-scale dice-coefficient error (E D;L ðY; ŶÞ) (3rd row), dice-coefficient error of the last layer (E D ðY; ŶÞ) (4th row) and binary crossentropy of the last layer (E C ðY; ŶÞ) (5th row).

Dual attention module maps vs. loss functions
Using the network architecture based on the mean of all DAM maps (M ðM dam Þ, as shown in Table 2), has higher accuracy scores compared to only using DAM map of the last layer and difference-image (M dam � M DðT 0 ;T 1 Þ , as shown Table 1); maximum-difference scores 3% IOU, 6% TPR, 5% TNR.In addition, the mean of multi-scale DAM maps (M ðM dam Þ) enhances accuracy scores compared to the final DAM map (M dam ) at most by 3% IoU, 2% TPR and 3% TNR (Table 2).
It is also worth mentioning using the sum of binary cross-entropy and the multi-scale dice-coefficient functions (E C ðY; ŶÞ þ E D;L ðY; ŶÞ) brings great benefits to detect changes between two images when using M dam or M ðM dam Þ in Tables 1 and 2.

DSIFN attention maps vs. dice loss functions
We also verify the performance of the network by adding M cam , M sam and M idf , cited in Zhang et al.
(2020), instead of the used M cam , with and without multi-scale dice-coefficient function.In Zhang et al. (2020), the raw image features M conv;T 0 and M conv T 1 with image difference feature M idf of the previous layer (in decoder part) is an input to the CAM.To produce M cam , an input is fed into multi-layer perception (MLP) operation after average-pooling and maximum-pooling of the input map.It is expected that the multi-scale dice-coefficient function (1st row and 3rd row) improves the IoU, TPR and TNR scores with 1-4% compared to singlescale function (2nd row and 4th row).Moreover, deriving M idf ; difference-map of two convolution maps from same layers at time T 0 and time T 1 , wise-produced with used M cam and then wiseproduced with M sam (3rd row and 4th row) yields higher scores than using the M cam (Zhang et al. 2020) (1st row and 2nd row) with around 1-2% improvements.(Zhang et al. 2020) and ðM idf (Zhang et al. 2020) � M cam ) � M sam (Zhang et al. 2020)

Comparison between the proposed method and benchmark methods
We compare the results of the proposed method with the previous techniques reported in literature (Daudt, Bertr, et al., 2018;Daudt, Bertr, et al., 2018;Chen et al. 2020;Li, Li, and Fang 2020;Zhang et al. 2020) based on visual interpretation and quantitative assessment in ACD, SYSU-CD and Onera datasets.For quantitative assessment, we used precision, recall and F1-score.

ACD-Szada
The ACD-Szada dataset mainly consists of open-area images, which usually are easier to identify the differences between them.Figure 4 shows RGB images at T 0 and T 1 and binary change maps after applying benchmark methods; where the TP pixels (changed pixels are classified correctly), FN pixels (changed pixels are classified as background pixels) and FP pixels (unchanged pixels classified incorrectly) are shown in yellow, red and green, respectively.All benchmark methods succeed in identifying general changed areas in the Szada images (Figure 4).However, the NestNet2 (Figure 4-g) mainly misses all multi-scale structures.
The DSIFN (Figure 4-h) classifies changed pixels incorrectly, producing inaccurate properties of many changed structures, although it employs multi-scale attention modules.In Figure 4-i, the proposed method almost distinguishes changed small structures, including entire boundaries and continuous lines.This is because the change detection architecture depends on the hierarchical features from different convolutional layers that maintain low-level and high-level details.Moreover, error functions depend on the error function from various multi-scale convolutional layers.On the other hand, it does not succeed to retrieve completely changed structures because of the higher noise level.
Table 4 shows the quantitative assessment of the proposed method compared to the benchmark methods in the Szada dataset.The proposed method achieves the highest scores with a precision 64:57 � 2:2%, recall 74:88 � 1:8% and F1-score 70:12 � 2:0%.The precision is relatively small because of the noise embedded between changed and unchanged pixels that would produce a high-value in difference-image and consequently the higher FP cases (green regions in Figure 4-i).Although the NestNet2 (2020 concentrates on the variations between the convolutional feature maps, it has low true-positive cases and consequently the lowest precision, recall and F1 scores.Compared to CD methods which are dependent on the early/late fusion of convolutional feature maps (e.g.FCEF, FCSD and FCSC), the proposed method improves precision, recall and F1-score by around 11-14% PPV, 10-12% TPR and 18-20% because it depends on both difference-image (M D T 0 ;T 1 ) with attention module on the pixel-to-pixel level (M sam ) and on the channel-to-channel level (M cam ) from multi-scale convolution layers.Also, it uses a multi-scale dice coefficient loss function to better capture overlapping between changed and unchanged areas starting from smaller-scale to largerscale structures.The DSIFN also uses the differenceimage and attention modules from the multiple convolutional layers; however, it uses a combination of the binary cross-entropy error and dice coefficient error at the final layer in the feature-difference stage, unlike the proposed method which uses it based on different layers.Therefore, it has higher FP cases and consequently lower precision (47:13 � 2:4%).Also, it is worth mentioning that adding a difference-image to change probability maps improves the recall scores (e.g.DSMSCN, DSIFN and the proposed method have 67:53 � 2:8%, 65:01 � 2:6% and 74:88 � 1:8%, respectively).We also compare the average inference time for all testing images.Mainly, all methods have identical inference time; required to produce binary change maps, but the proposed method spends the second shortest time to predict the binary change map, with a lower standard deviation.

ACD-Tisza
We use the previous model and retrain in the Tisza dataset (fine-tuning), which consists of open-area images similar to the previous dataset.Figure 5 shows binary change maps after applying all benchmark methods.All methods mainly restore large-size structures (change in vegetation, new buildings).However, many methods suffer from incomplete detection of curved boundaries (Figure 5-c-f).The DSIFN (Figure 5-h) and the FCSC (Figure 5-e) restore additional structures which are not parts of changed regions because it is based on concentrated features yielding a high value in the binary change map.Although the DSIFN (Figure 4-h) has ideal performance in identifying small-scale structures, it does not pinpoint the main structures.On the other hand, the NestNet2 (Figure 5-g) misses all changes, similar to the previous results.A subjective visual comparison with other CD methods shows that the proposed method works the best in terms of boundary accuracy and the internal structure of the new buildings.It is essentially consistent with reference ground-truth images.It is also remarkable to notice that all methods classify some regions as changed pixels; however, they are unchanged regions in ground-truth images and changed regions in bi-temporal images (Figure 5-a-b).This could be interpreted as missing changes in manual reference images.

SYSU-CD
We train the proposed method in the SYSU-CD dataset to produce binary change maps.As shown in Figure 6, all methods succeed in detecting many changed regions.However, all miss internal structures of the changed regions; because images at T 0 and T 1 have high noise levels (e.g.shadow, overlapping between trees and roads, or buildings and roads or buildings   (Daudt, Bertr, et al., 2018) 70.32 � 2.2 71.84 � 1.8 71.07 � 2.1 2.0 � 0.2 FCSD (Daudt, Bertr, et al., 2018) 85.13 � 1.7 57.21 � 2.1 68.57� 2.3 2.1 � 0.1 FCSC (Daudt, Bertr, et al., 2018) 78  and roads, etc.).In addition, reference ground-truth images do not identify many changed regions.The DSIFN (Figure 6-h) and the proposed method (Figure 6-i) are consistent with reference groundtruth images in retrieving overall structures.The NetNet2 (Figure 6-g) does not restore all structures.The remaining methods (Figure 6-c-f) identify some changed outlines of roads and buildings which are changed in Figure 6-b.
In Table 6, we compare all CD methods in the SYSU-CD dataset.All methods, excluding NestNet2, which employ difference-images or attention modules have high scores (above 70%) because reference groundtruth images consist of irregularly changed broad regions, which are easier to distinguish.The proposed method has the highest recall score because it uses the average of binary change maps from multiple layers and multi-scale dice-coefficient error in the featuredifference stage to better capture detailed information.It is worth noticing that the proposed method spends the shortest time predicting the binary change map.

Onera
We train the proposed method in the Onera dataset.Figure 7 presents changed areas in Dubai city from the Onera dataset.We expect that all methods, which are dependent on the late-feature fusion such as FCSD and FSDC, focus on learning contextual object-level features such as compacted buildings, continuous roads and complete boundaries.However, the FSCD (Figure 7-d) fails to retrieve some contextual shape features (e.g, outlines of roads); however, the FCEF (Figure 7-c) identifies these features.The FCEF change map (Figure 7-c) shows broken object boundaries and poor object internal compactness because low-level features of raw images can hardly be provided to help image reconstruction through skipconnections.Surprisingly, the FCSD (Figure 7-d) and the FSDC (Figure 7-e) also do not succeed in reconstructing large-scale to small-scale structures.The DSMSCN (Figure 7-f) and the DSIFN (Figure 7-h) retrieve uninterrupted lines and many compact buildings but miss many outlines.On the other hand, the NestNet2 (Figure 7-g) fails to recover many urban structures.The proposed method succeeds in distinguishing continuously changed lines from the entire region (e.g.roads).It shows complete composite structures (e.g.buildings).It also presents some small structures which are not shown in all previous maps.
In Figure 8, we show predicted change maps from small areas of Chongqing and Las Vegas cities from the Onera dataset.These examples demonstrate that the proposed method has good performance in distinguishing changes in multi-level structures starting from small, middle to large details (pixel, region to object-level) such as illumination variations to new whole building structures.
Table 7 shows precision, recall and F1-scores of binary change maps from the Onera dataset.When we use three bands (true color images), all methods do not succeed to retrieve changed areas and the proposed method achieves the best performance with precision, recall and F1-score equal to 50:21 � 2:0% and 55:81 � 1:9% and 52:12 � 2:3%, respectively.One of the future directions is to optimize the proposed method to adapt it with multi-spectral images.

Figure 1 .
Figure 1.An overview of network architecture: (a) feature-extraction and (b) difference-extraction.W, H and C are width, height and number of channels, respectively.

Figure 2 .
Figure 2.An overview of spatial attention modules (SAM).W, H and C are the width, height and number of maps from the branch I and II, respectively.
First, M conv is reshaped and transposed to produce M i (N � C) and M ii (C � N); where N ¼ W � H.The reshaped and transposed matrices are multiplied to generate M c (C � C) after applying softmax activation function (Eq.3).The M c measures the impact of each channel of M i on each channel of M ii .The weaker the connection between two channels, the smaller values of matrix M c .Second, the M c is multiplied with the M i to produce M cc (W � H � C) (Eq.4).Finally, the CAM output M cam (W � H � C).The M cam is produced by element-wise summed the CAM input M conv with M cc (Eq.5).

Figure 3 .
Figure 3.An overview of the channel attention module (CAM).W, H and C are the width, height and number of channels from both branches, respectively.
•DSMSCN Chen et al. (2020): the encoder is divided into two Siamese networks with multiscale feature convolution units (MFCU).The decoder network uses the difference between convolutional layers in two encoder networks.• NestNet2 Li, Li, and Fang (2020): It uses UNet++ and fully convolutional Siamese networks as encoder networks.The decoder network uses the channel attention module (CAM) to concentrate the multi-scale convolutional layers of two encoder networks.• DSIFN Zhang et al. (2020): it uses two VGG16 networks as encoders.The decoder integrates the difference-maps of convolutional feature maps into multi-scale dual attention modules (DAM).

Figure 4 .
Figure 4. Comparison between the proposed method and benchmark methods in the Szada dataset: (a) image at T 0 , (b) image at T 1 , (c) ground-truth image (changed and unchanged pixels are depicted in white and black, respectively), (d) FCEF, (e) FCSD, (f) FCSC, (g) DSMSCN, (h) NestNet2, (i) DSIFN and (j) the proposed method (true positives are depicted in yellow, missed changes in red and false positives in green).

Figure 5 .
Figure 5.Comparison between the proposed method and benchmark methods in the Tisza dataset: (a) image at T 0 , (b) image at T 1 , (c) ground-truth image (changed and unchanged pixels are depicted in white and black, respectively), (d) FCEF, (e) FCSD, (f) FCSC, (g) DSMSCN, (h) NestNet2, (i) DSIFN and (j) the proposed method (true positives are depicted in yellow, missed changes in red and false positives in green).

Figure 6 .
Figure 6.Comparison between the proposed method and benchmark methods in the SYSU-CD dataset: (a) image at T 0 , (b) image at T 1 , (c) ground-truth image (changed and unchanged pixels are depicted in white and black, respectively), (d) FCEF, (e) FCSD, (f) FCSC, (g) DSMSCN, (h) NestNet2, (i) DSIFN and (j) the proposed method (true positives are depicted in yellow, missed changes in red and false positives in green).

Figure 7 .
Figure 7.Comparison between the proposed method and benchmark methods in the onera dataset: (a) image at T 0 , (b) image at T 1 , (c) ground-truth image (changed and unchanged pixels are depicted in white and black, respectively), (d) FCEF, (e) FCSD, (f) FCSC, (g) DSMSCN, (h) NestNet2, (i) DSIFN and (j) the proposed method (true positives are depicted in yellow, missed changes in red and false positives in green).

Figure 8 .
Figure 8. Building detection on from chongqing and Las Vegas cities: (a) and (e) input image at T0, (b) and (f) input image at T1, (c) and (g) ground-truth images (changed and unchanged pixels are depicted in white and black, respectively), and (d) and (h) change maps (true positives are depicted in yellow, missed changes in red and false positives in green).
. In this paper, we use the Szada dataset to train the convolutional model.It comprises 42 pairs of optical aerial images acquired from different years of different seasonal conditions.The number of image pairs is relatively small and we divide images into 20,000 patches of size 256 � 256.We use 12,000, 3000 and 5000 pairs as training, validation and testing images, respectively.We use the proposed model and retrain it in the Tisza dataset (24 pairs).This dataset can be downloaded from http://web.eee.sztaki.hu/remotesensing/airchange_benchmark.html Image pairs consist of red, green and blue bands with dimensions 952 � 640 and a spatial resolution of 1.5 meters per pixel.Histogram matching is applied to the two co-registered images for achieving color consistency.The annotated changes focus on changes in agriculture areas (new built-up regions, fresh plough-land and groundwork before building).

Table 1 .
Dual attention module map M dam and the mean of DAM maps M ðM dam Þ wise-produced with difference-image M DðT 0 ;T 1 Þ vs. loss functions; cross-entropy binary error E C ðY; ŶÞ, single-scale dice error E D ðY; ŶÞ, and multi-scale dice error E D;L ðY; ŶÞ.The highest score is shown in blue.

Table 2 .
Dual attention module map M dam and the mean of DAM maps M ðM dam Þ vs. loss functions: crossentropy binary error E C ðY; ŶÞ, single-scale dice error E D ðY; ŶÞ and multi-scale dice error E D;L ðY; ŶÞ.The highest score is shown in blue.

Table 3 .
Channel attention module map M cam

Table 4 .
Comparison between the previous methods and the proposed method based on quantitative metrics in the Szada dataset.The best score and the worst score are presented in blue and red colors, respectively.

Table 6 .
Comparison between the previous methods and the proposed method based on quantitative metrics in the SYSU-CD dataset.The best score and the worst score are presented in blue and red colors, respectively.!