Building change detection using the parallel spatial-channel attention block and edge-guided deep network

Building change detection in high-resolution satellite images plays a special role in urban management and development. Recently, methods for building change detection have been greatly improved by developing deep learning. Although deep learning technologies, especially Siamese convolutional neural networks, have been successful and popular


Introduction
Building change detection refers to the detection of spatial changes in buildings within a defined geography by multi-temporal remote sensing imagery (Qin et al., 2016;Sun et al., 2020).The identification of building changes has various applications in urban and rural planning and development (Yang et al., 2021;P.);crisis management (Rastiveis et al., 2013), and updating topographic and urban cadastre maps (Ayazli et al., 2019;Duarte et al., 2018).In recent years, automated technologies, particularly machine learning approaches for change detection, have accelerated the development of remote sensing applications (Khelifi and Mignotte, 2020;Shirowzhan et al., 2019;Tewkesbury et al., 2015).Over the last decade, a new generation of very-high-resolution (VHR) satellites has been launched, capable of taking images with 1-m or better resolution (Wen et al., 2021).The spatial qualities and shapes of features are highly defined in VHR images, which is critical for analyzing and identifying changes in manufactured features such as buildings (Bruzzone and Bovolo, 2013).The enhancement of change detection accuracy through the extraction and learning of deep and practical feature information spatially edged in buildings, as well as the reduction of pseudo-change detection, are critical topics in remote sensing building change detection.
The two most widely used methods for identifying changes are based on pixel-based and object-based scenarios (Hussain et al., 2013).Pixelbased approaches directly compare each pixel's spectral information or texture and produce a map of changes via a threshold or categorization.While this method is straightforward to execute, it does not incorporate spatial context information.The resulting images exhibit significant salt and pepper noise, particularly when high-resolution (HR) and VHR images are used (Cao et al., 2014;Volpi et al., 2012).As a result, it is more suited to detecting changes in images with a medium resolution.Object-based approaches begin by extracting image objects and then analyzing changes in HR and VHR multi-time images using rich spectral, textural, structural, and geometric information.Although these approaches take advantage of the geographical context, manual feature extraction methods are rather complex and have poor strength (Gil-Yepes et al., 2016).Because most conventional change detection methods entail manually extracting features from the image, picking suitable features is sometimes tricky and undermines the methods.The use of a deep learning (DL) algorithm to intelligently extract features has been studied in recent years (Ball et al., 2017).Due to the scarcity of innovation and research in this field, the growth and development of DL architectures are critical for remote sensing and image processing (Zhang et al., 2016).As a result, DL techniques and the extraction of various deep spectral, spatial, and temporal features have been widely applied in research to analyze a variety of HR-VHR-SAR and LiDAR images (Li et al., 2021;Parikh et al., 2020;Pritt and Chern, 2018).CNN has been widely used to extract rich features at the highest level, and it can optimize the number of parameters used to learn non-linear characterization (Mou et al., 2019;Wang et al., 2020;Xu et al., 2017).VGGNet (Wang et al., 2021), AlexNet (Yuan and Zhang, 2016), SegNet (Badrinarayanan et al., 2017), U-Net (McGlinchy et al., 2019) and many other networks based on the CNN framework have been developed by researchers.Thus far, the DL results in remote sensing have been excellent, and the research is still in progress (Ma et al., 2019).
Change detection methods based on DL are classified into classification-based and metric-based methodologies (Khelifi and Mignotte, 2020).Classification-based methods combine the first-and second-time images into a joint vector, and deep features are extracted from the generated bi-temporal images.In such methods, the bitemporal images are entered into a single-branch network, and the changes are then categorized according to their change score (Chen and Shi, 2020).For instance, (Peng et al., 2019) developed end-to-end improved UNet++ to extract global and fine-grained features and fuse them to generate a high-accuracy final change map.In the metric-based method, the changes are obtained by comparing the distance between two images.In this method, each image is usually entered into a separate network, and the network becomes a dual branch.The embedding space is trained so that the similarly embedded vectors come closer and the dissimilarity embedding vectors move away from each other (Zhan et al., 2017).This method is used more effectively to evaluate image pair details than the classification-based method (Xue et al., 2021).The embedding space can be formed with Siamese networks (Zhan et al., 2017) that include two networks with shared weights or two separate networks with independent weights (Song et al., 2021).(Zhang and Lu, 2019) used a spectral-spatial joint learning network (SSJLN) and the Siamese CNN for multispectral imagery change detection.(Xue et al., 2021) proposed a building change detection network that extracted multi-feature enhancement layers from information from the central and accessory branches.In the change detection issue, the focus has been on increasing regional accuracy.Due to the improvement of the spatial resolution of satellite images, the increase in the variety of objects and the complexity of the boundaries of objects has created significant challenges for accurate segmentation in detecting changes (Shafique et al., 2022).A densely connected Siamese network was introduced by (Fang et al., 2021) to transmit compressed information between network layers to reduce the loss of localization information in deep layers.
Specifically, in the building change detection topic, high density of buildings, hiding part of the buildings in the shadow occlusion, building roofs with different materials, and a significant similarity between the building roof and other objects such as roads, causing blurring of the edges.As a result, accuracy decreases in the problem of identifying building changes (Zhou et al., 2022).Using convolution with more complex layers, dense integration of features and using attention are among the latest methods to increase the accuracy of building change detection.(Zheng et al., 2021) introduced a new cross-layer deep network to extract multi-scale and multi-level contextual information in the two parallel branches for increasing building change detection accuracy.Attention mechanisms have been used to acquire long-range discriminative features to improve building change detection accuracy.For example, (Ding et al., 2021) applied spatial attention by adding cross-layer and skip connection to extract contextual features, Also multi-scale features are extracted by an atrous spatial pyramid pooling (ASPP).Also, in addition to special attention, channel, temporal and self-attention or their combinations as multi-attention, have been used to extract long-term discriminative properties in identifying changes (Chen et al., 2021;Ding et al., 2021;Ma et al., 2022).Research has shown that the use of spatial and channel attention together extracts the distinguishing, contextual and long-range features that increase the accuracy of detection (Guo et al., 2022).In most cases, the sum fusion of attentions is used to detect building changes (Chen et al., 2021), while in some research on other image processing felids, it has been shown that other types of combination, such as parallel combination, have been able to provide better results (Yin et al., 2020).In addition to the inherent problems for building change detection, such as the high density of buildings and other mentioned cases, the use of DL networks causes the loss of part of the spatial information in the down-sampling process and may cause the creation of inaccurate edges (Weng and Zhu, 2021) that are less is taken into consideration.Using multi-level features such as residual learning (Diakogiannis et al., 2020), dense connection fusion and skip connection (Fang et al., 2021) can perceive boundary information and improve the edges to some extent.In some research that has been done on building segmentation, edges have been used as primary information, which has made deep CNNs learn more boundary features and successful results have been obtained (Jung et al., 2021).Also, some multi-task research that, besides extracting buildings, additionally perform shape-related tasks, such as learning boundaries or distance and taking into account the constraints of boundary consistency in the loss function, segmentation improvement takes place (Shi and Zhang, 2021).
Considering the challenges mentioned in the building changes detection in complex scenes with high building density, the existence of shadow occlusion, the presence of shape and material similarities between buildings and some other objects, registration errors and seasonal changes that occur in remote sensing time series images, achieving correct building changes with accurate and complete boundaries have been a challenging issue until now.In order to answer the mentioned problem, a new methodology based on a dual-branch DL network and a parallel spatial-channel attention mechanism is proposed.In this method, the parallel attention block is used to extract the spatial discriminative features to more robustly detect changes.Furthermore, to retrieve the shape and details of changed buildings in the edges smoothed in the down-sampling process, a new dual loss function is designed that considers a consistency constraint between the changed building masks and edges and an unbalance between changed and unchanged samples in the data.The suggested method was implemented in the building change detection datasets (BCDD) (Ji et al., 2019) and LEVIR-CD (Chen and Shi, 2020), In addition, it was compared with several SOTA methods.

Methodology
Given the need to extract spatial and spectral discriminative features and solve the edge loss problem in down-sampling deep networks, this paper proposes a parallel spatial-channel attention dual-branch network and a new loss function.displays the general concept of the proposed network.The suggested method is based on a dual-branch deep metric learning network designed to excerpt deep features from two multitemporal images with different exposure conditions.In addition, at the end of each side of the network, a parallel spatial-channel attention block is used to extract the distinguishing features.A new loss function which keeps the shape at the building edges was introduced.The network details, including the dual-branch network, the spatial-channel attention mechanism, the parallel method for attention combination, and the edge consistency constraint-weighted binary cross entropy (ECC-WBCE) dual loss function are explained.

Network overview
Because of the time differences usually occurring between datasets used for change detection, there are usually large radiometric and even geometric differences between the two images.As such, the two data sets are different, and even if they were taken from the same sensor, it looks like they were obtained from two different sources.Consequently, it is more efficient to use two separate networks.For the given reasons, a dual-branch network was designed and implemented.The dual-branch network has two-sided input and one expansive path in the middle.Each contraction side has four levels of encoding, including CNN, batch normalization, and dropout.The number of decoding layers also depends on the number of convolution layers and includes batch normalization.At the end of each side of the network where the features are extracted from the two-input data, the dual attention unit is calculated to intensify the discriminator features and extract components independent of shadow, angle, noise, and context.Using the dual attention mechanism, in addition to extracting more features from HR images containing additional information, they create connections between spatial information and extract general context information to better separate changed and unchanged areas.After that, the Euclidean distance between the features is computed, and deconvolution is performed.Low-level features are copied symmetrically from both sides of the network in the expansion path and are combined and stacked with high-level information.Then, deconvolution is carried out, ensuring that the network weights on both sides are independent.The network backpropagation path transmits the loss values produced by the loss function to either side of the network, concurrently updating the network weights on both sides.In this way, nonlinear simulation of multi-source data or data with different conditions is simultaneously performed.

Spatial-Channel attention mechanism 2.2.1. Spatial attention mechanism
The spatial attention mechanism (SAM) is an adaptive selection of spatial regions that determines where attention should be paid (Guo et al., 2022).Therefore, it is a method for extracting discriminative features that is also very effective and important in identifying changed and unchanged pixels.SAM is used to measure the rich context of local features.The SAM, a kind of self-attention mechanism, encodes a long range of local features as background information, thus increasing the quality of the represented features.Alternatively referred to as intraattention, self-attention is a process for calculating a sequence representation by focusing attention on different positions within the sequence (Vaswani et al., 2017).Self-awareness techniques have proved successful in simulating long-range dependency.Key, value, and query are the three primary parameters in defining the self-attention mechanism, and the query is mapped to a key-value pair at the output (Vaswani et al., 2017).The relationship between these three parameters is given in Equation (1).
where k i and V i are the key and value vectors from the source s and q is the query vector.α i (q.k i ) is the similarity function between the query and the corresponding key, computed by using the softmax function.In this article, the term key (or query, value) refers to a vector at a particular point in a key (or query, value) tensor, in which key, query, and value tensors are generated separately by three different convolutional layers.
According to Fig. 2, the spatial attention input is a feature F consisting of three dimensions C × H × W. One is the number of bands (C) in F, another is its height (H), and the third is its width (W).Applying three convolutions with the same structure creates three new features,V, K, and Q, with the same dimensions C × H × W.These three are the same vectors for value, key, and query.To obtain a similarity matrix, the ith key and the j th query are first calculated, then, V and K are modified to (2) Fs ji can compute the efficacy of the value feature in location i on the key feature in location j.Note that ̅̅̅ ̅ C √ is used in the denominator to normalize Fs ji and reduce the impact of a large amount of C (Chen and Shi, 2020) .The stronger the connection between the two, Fs ji to be greater.Then Q is transformed into C × N and the outcome is obtained by conducting matrix multiplication with Fs, transforming it into C × H × W. In the end, a scale parameter γ is multiplied by the previous result, conducted by F as an elementwise summing operation according to (3) Equation.
Parameter γ takes the initial value of 0 and increases its value in the network with training.The above equation shows that Fsa in any position is the product of the original feature's weighted sum of features in all positions.It is a broad context that collects contexts selectively based on the spatial attention process.They increase the similarity of semantic properties', improving the semantic compactness and consistency of the class.Consequently, this will allow the network to identify changes and pseudo-changes more precisely.

Channel attention mechanism
The channel attention mechanism (CAM) is a mechanism for channel-based attention in convolutional neural networks (Guo et al., 2022).An object is usually indicated by a different channel in different  A. Eftekhari et al. feature maps based on a deep network.The weight of each channel adjusts by channel attention, and it can be considered what should be paid attention to during the process of choosing the object (Hu et al., 2020).To create a map of channel attention, the relationships between channels and various features are used.By considering the relationship between channel maps, the interdependent features become more pronounced, the display of meaningful features is improved, and the accuracy of detecting changes is improved.So we used CAM to organize the relationships between the channels.As shown in Fig. 3, we did not use convolution to create new features.Instead, we reshape the feature F input by size C × H × W, to the C × N size, where N is H × W.Then, multiply F in F transposition to get the channel attention module with size N × N.After that, a Softmax is used in Equation (4).
where Fx ji shows how the jth channel affects the ith channel, an increased number indicates greater connectivity between the two channels.Then multiply the reshaped F by C × N with Fx ji to get the result.Finally, add a coefficient δ of the previous result to F to get the final output as Equation ( 5).
where δ starts at 0, and its actual value is obtained during training.A channel's final feature is the weighted sum of all channels and the feature itself, derived from the equation above.Finally, the spatialchannel features are concatenated with each other.They are calculated in the same way for each side of the network.Subsequently, the Euclidean distance of the two-sided features is computed, and the result enters the network expansion section.

Parallel spatial-channel attention block
As previously discussed, spatial attention creates features focusing on the location where the changes have occurred, and channel attention generates features highlighting the informative channels.Therefore, the combination of the spatial and channel attention mechanisms is beneficial in identifying changes accurately.A straightforward way is to combine the spatial and channel attention modules in a series as used in (Sun et al., 2019) for salient object detection.However, with this acquisition, some of the details are lost.Therefore, in this paper, the parallel combination method is proposed.However, overemphasized features can cause some details to be lost.Therefore, a parallel combination of spatial-channel attention blocks is introduced in this paper (see Fig. 4).
Three parallel branches process the last feature calculated from the network.The first branch is the same as the original feature; the second branch consists of the spatial attention module; and the third branch contains the channel attention.Then, these three parallel branches are mixed by a residual learning strategy.The residual learning method is an excellent alternative to the concatenation fusion method, which produces more efficient attention features.In addition, informative information is learned more firmly using the residual learning method in a parallel attention mechanism.
If X is the input features vector of the previous step and Fca is the channel attention output, the residual learning between merging X and Fca is determined as follows: where O conv indicates a 3 × 3 convolution operator.Z is the output of this step and the input of the next fusion step.Then the following operations are performed Equation ( 7):  where Fsa is the spatial attention output, and R is the last output of the parallel spatial-channel attention fusion.Note that after combining Z and Fsa attributes in the residual learning, the input attribute X is added to the final output.With this method of fusion, more detailed information about the hybrid dual attention features can be extracted.

ECC-WBCE loss function
ECC: Since shape details are lost in the expansion process of U-shaped networks, conditions and methods must be used to preserve the shape as much as possible.Edges encode important shape information and can be used to improve this information.Here, we used the edges of the change mask.Thus, an appropriate loss function is determined by extracting the edge of the changed and unchanged regions and defining the consistency constraints between the edge and the change mask.Therefore, to lessen the difference between the ground truth change mask and the predicted change mask, the difference between the edges extracted from the change and the predicted change mask must also be minimized in the network learning process.To model the consistency constraints between the change edge and the change mask, the change mask is converted into the change edge using different functions.The transformed function calculates the maximum difference between the predicted change mask and its neighboring pixels so that the values of the change edges are assigned to 1, and the other values are assigned to 0 for the transformed boundary map.The transformed function is obtained from Equation (8): where x shows the probability values of changes in the predicted change area; the kernel size and stride of the minimal pooling operation were set to three and one, respectively, to ensure that the output map is of the same size as the projected change mask.This function is also applied in the training process.
To apply consistency constraints between the mask and the change edge, the loss function of the change ECC is defined to minimize the difference between them: | • | is the distance function as L1.M pre represents the mask of predicted changes, E pre indicates the edge of predicted changes, and M gt denotes the mask of ground truth changes.Instead of L1, other functions such as mean squared error (MSE) or binary cross entropy (BCE) can be used.In Part 4.3, an ablation study was conducted on this issue, and the experiments showed that using L1 provides better results.WBCE: Changed and unchanged data are unbalanced in the building's change detection training data, and the unchanged data are multiples of the changed data.In the case of simple loss functions, this imbalance is not taken into account.Therefore, we have used the WBCE loss function for the imbalance problem and its effect on final accuracy as Equation ( 10), so that its weights are determined in proportion to the number of training data.
where Mpre indicates the predicted changed building mask and Mgt denotes the changed building mask ground truth.w 1 and w 2 are weights of the changed and the unchanged feature pairs, calculated from Equations ( 11) and ( 12) as follows: where N U and N C are the number of changed and unchanged pixel pairs, respectively.

Final loss:
The final loss function is obtained from the sum of the ECC introduced in Equation ( 9) and WBCE introduced in Equation ( 10).Thus, the final loss function (L f ) is obtained from the following equation: where λ 1 and λ 2 are weight coefficients for combining two losses determined in experiments.

Accuracy assessment
Precision-recall is an effective predictability metric when the classes are highly imbalanced (Chicco and Jurman, 2020).Since the changed points are usually much less than the unchanged ones in the change detection problem, and since there is no balance between these two data, the use of precision-recall metrics is appropriate.Therefore, precision (Pr), recall (Re), and F1-score (F1) are used to evaluate the results.Their equations are given in the following mathematical relations (Tharwat, 2018).
In building change detection, TP is a true positive showing the correctly determined changed pixels, FP is a false positive denoting the falsely determined changed pixels, and FN is a false negative representing the changed pixels that are falsely determined as unchanged pixels.Regarding the change detection case, the less false detection happens, the higher the precision, and the fewer the predictable results missed, the higher the recall value.F1 is the overall criteria to evaluate the results, and a higher value leads to more desirable fitting results.

Dataset
A) BCDD Dataset: BCDD is a dataset for building change detection from an area in New Zealand where an earthquake with a magnitude of 6.3 happened in February 2011.Then, aerial images of 2012 after the earthquake and images after reconstruction in 2016 were taken.The two 100 % overlapped datasets with a 20-cm spatial resolution consisted of 12,796 buildings in 20.5 km 2 and 16,077 buildings in the same area in the 2016 data.A number of images from this database are shown in

Implementation details
The Tensorflow backend is used to implement the proposed method, in which a single NVIDIA GTX 1070Ti GPU runs with 8 GB of memory.The Adam optimizer was developed with learning rates decaying from 1e to 1 to 1e-4, and the training data with a batch size of 20 entered the network.The λ1 value was set to 1, and the value of λ2 was set to 0.5 based on the performance experiments.In both the BCDD and LEVIR-CD datasets, 70 % of the data was used for training, 20 % for testing, and 10 % for validation.Due to the graphics card's limitations and according to the model presented in Fig. 1, we chose an input size of 128 × 128 pixels.Therefore, the number of training, testing, and validation data was equal to 21070, 6020, and 3010 for BCDD and 28480, 8192, and 4096 for LEVIR data, respectively.All training data were augmented by horizontal and vertical random flipping and random rotation ( − 25 • to 25 • ).
Finally, the training times were set to 100 epochs until the convergence of the model.

Experimental results
Several experiments are designed to show the effects of using a dualbranch network, the spatial-channel attention mechanism in two different concatenation and parallel fusion modes, and the ECC-WBCC loss function.A single-branch Unet (ResNet50) (Diakogiannis et al.,Fig. 10.Visual comparison between state-of-the-art methods and the proposed method on the BCCD dataset.2020) and the proposed dual-branch model are selected as the basic networks.U-Net (Resnet50) means a single branch U-net network, where the first and second-time images are stacked together and enter the network at the same time.In other words, the input of this network is an image with six bands of the combination of the first and second time.The effect of the spatial-channel attention mechanism and its various fusion modes and ECC-WBCE loss function is also investigated on BCDD and LEVIR-CD datasets.

Results of BCDD
The proposed method was also implemented on the BCDD data step by step.The results are given in Table 1.In Table 1, U-Net ( ResNet50) is a single-branch network that has six input bands; also, "U-Net (ResNet50) + concatenate spatial-channel attention" means using spatial-channel attention at the end of the U-Net (ResNet50) network encoding stage that used concatenation for fusion of two attentions.Moreover, "U-Net (ResNet50) + parallel spatial-channel attention" means using spatial-channel attention at the end of the U-Net (ResNet50) network encoding stage that used the proposed parallel mechanism for fusion of two attentions.
Using the proposed dual-branch network without any attention mechanism increased F1 by an average of 5.66 %.The use of the spatialchannel attention mechanism by the concatenation fusion method also increased the F1 average by 5.35 % for binary cross entropy loss and by 4.86 % F1 for the ECC-WBCE loss function with two baselines: single-branch U-Net(ResNet50) and the proposed dual-branch network.The parallel spatial-channel attention fusion method has further increased the network performance, such that the value of F1 has increased by an average of 1.12 % compared to the concatenated spatial-channel fusion method.The visual results of the step-by-step implementation of the proposed model on the BCDD data are also shown in Fig. 8. Since the spatial resolution of images in the BCDD is 20 cm, the model implementation results can be seen in more detail.It is also clear from Fig. 8 that using the proposed dual-branch network (in columns d and h) makes the changes more complete, and the FN error rate has been greatly reduced.The parallel spatial-channel attention mechanism has increased the extraction of details and, in addition to reducing FN, has raised TP.Eventually, in the second and fourth rows of Fig. 8, the ability of the proposed algorithm and the dual loss based on ECC-WBCE to extract the edge details in column k is well observed.

Results of LEVIR-CD
We have shown the effect of using a dual-branch network, concatenate, and parallel spatial-channel attention, and the proposed dual loss function.Precision, recall, and F1-score have been used to evaluate the results.The result of our proposed method on the LEVIR-CD is shown in Table 2. Visual results are also given in some examples in Fig. 9.The first point in Table 1 is the effect of using the proposed dual-branch network, which increases F1 by 6.67 % in the case of binary cross-entropy loss and 8.67 % in the case of using a dual loss function.By comparing the results of Fig. 9 in columns (d, e) and (h, i), it is observed that the proposed dual-branch network has had good results, the prediction error is much less, and the changes are much more complete.Using the spatial-channel attention mechanism in the two fusion modes has raised precision, recall, and F1-score on U-Net (ResNet50) and the dual-branch models.The value of F1 when using the proposed method with a parallel spatialchannel attention mechanism and ECC-WBCE loss function has risen by about 14 % compared to the U-Net (ResNet50) baseline network, which significantly improves and indicates the proposed method's performance.By checking the results in Fig. 9 in columns (f, g) and (j, k), it is observed that using a parallel spatial-channel attention mechanism improved the results and extracted the changes in more detail.Ultimately, we investigated the effect of using a dual loss function based on ECC-WBCE.The proposed loss function has increased the F1-score value by an average of 2.07 in all six cases in Table 2.In particular, the new loss function has further increased the value of recall, which indicates a decrease in FNs due to the use of this function.Furthermore, the increase in precision indicates a suitable decrease in FP due to the dual loss function.The e, g, i, and k columns in Fig. 9 present the dual loss function's effect.Due to the boundary proximity condition in the loss function definition, the changes in the edges have been extracted with better accuracy, and the network has tried to extract the changes completely and consistently.

Discussion
To better understand the proposed method's results and how well it worked, we implemented the other state-of-the-art CD methods on both datasets and compared them with the proposed method statistically and visually.The efficiency analysis was performed for the state-of-the-art CD methods on LEVIR-CD data to demonstrate how the proposed method works.The proposed method is also evaluated with other stateof-the-art CD techniques.In the end, comparative work had been done to show the effect of using different types of ECC loss operators on the final performance of the proposed method.

Comparisons with state-of-the-art methods
Several state-of-the-art methods are implemented to appraise the performance of our method.(Peng et al., 2019) proposed U-Net++ for high-resolution satellite image change detection using an effective encoder-decoder architecture and fusion of multiple side outputs.The STANet (Chen and Shi, 2020) is a method proposed by the creators of the LEVIR-CD dataset.They modeled the spatial-temporal dependency by using the self-attention mechanism.DDCNN (Peng et al., 2020) applies dense attention and several up-sampling attention units to bi-temporal images.Moreover, introducing a DE unit improves the efficiency of the network.AGCDetNet (Song and Jiang, 2021) is an attention-guided network for change detection in which multilevel features and multiscale context are enhanced by using spatial attention and a channelwise attention-guided interference filtering unit module.Another comparative method is SNUNet (Fang et al., 2021).It is a combination of a densely connected Siamese network and NestedUNet.This network transmits compressed information between network layers to reduce the loss of localization information in deep layers.The ensemble channel attention module is also applied to improve representation features in different layers.In implementing the SNUNet method, we considered the number of feature map channels equal to 48 to achieve better From the comparison methods mentioned so far, the first two are based on U-shape-based variants, and the next four are based on attention mechanisms.Besides, the EGRCNN (Bai et al., 2022) method is based on the extraction of separating information based on the use of a difference analysis module and primary edge information to improve the accuracy of building change detection and produce highly accurate structural edges.In addition, an edge guidance module is designed to aggregate previous boundary information.Finally, the EGCTNet (Xia et al., 2022) method is based on the combination of CNN and a transformer.In this method, the global noises of the encoding stage are eliminated, and building edges are utilized to guide the features.Moreover, the edge detection branch and the edge fusion module are designed to combine with edge features.
For better comparison, all methods were implemented with the hyperparameters listed in Table 3. Due to the limitations of the graphic card, we set the dimensions of the input data to 128 × 128.Because the size of the input samples is smaller, it is also possible to run with larger batch sizes (we considered 20 here).These methods were implemented on the BCDD and LEVIR-CD datasets.
The evaluation results of the proposed method by the state-of-the-art techniques are listed in Table 4 for the BCDD dataset and Table 5 for the LEVIR-CD dataset.According to the results of the comparison, our method has superior performance to the previous appropriate change detection methods.Therefore, the value of F1 has grown by 2.43 % and 1.83 % in the BCDD and LEVIR-CD datasets, respectively.Notable is the excellent precision of the BCCD dataset, which has grown by 3.35 % compared to previous optimal methods.A high Pr indicates the extraction of local information using a dense connection between multiscale features.Also, the Re parameter has increased in both datasets, respectively, 1. 1) In areas with high building density, it is very challenging to recognize the correct shape of building changes.Especially in these parts, usually, the edges are not extracted correctly.The results show that the changes are more complete in the BCDD data (samples in columns 1 and 3) and in the LEVIR data (samples in columns 2 to 4), where the building density is higher, and the changes are more complete with less boundary noise detected.The proposed method described image boundary information well for small and dense buildings, mainly seen in LEVIR data samples, so the ability of models to detach these buildings separately is a significant challenge.
As shown in Fig. 11, some methods have not been able to extract each building separately due to the shadow between the buildings.Most of the buildings have been extracted continuously and without interruption.However, in our proposed model, most buildings are extracted separately, no matter how close they are to each other.2) Large buildings that are newly built usually have surfaces with different materials, heights and reflections, which are challenging to fully and seamlessly extract.Examples of Large-scale construction changes are given in the first and second columns of Fig. 10 and in the first column of Fig. 11 that our method could extract these buildings properly.While the results of some methods, such as AGCDetNet, DDCNN and EGCTNet, have been less error-prone, they   still could not fully detect the changes.Nevertheless, our proposed method has fewer FP and FN errors than other methods and has been able to detect all changes in a large building more accurately.As it turns out, the edges are well-matched to reality and are much less error-prone.
3) The spectral similarities between the roofs of buildings and other features, such as roads, in many cases, cause the false detection of changes.For example, in the EGRCNN method, which has comparable results to our proposed method, it can be seen that roads close to buildings are identified as changes.In contrast, our proposed method has been able to distinguish between the building and the road or objects of the same color as the building.Also, some methods have identified shadows or small objects such as containers or cars as changes in our proposed method, while this type of error is much less observed.4) The last column of Fig. 10 shows an example of negative building changes.Although these types of changes due to complete demolition were scarce in our training data, the proposed method could still estimate these changes well.This is because WBCE is defined by the dual loss function in which weights are set in proportion to the number of changed and unchanged training data.

Efficiency analysis
To better appraise the proposed model compared to the state-of-theart models, we performed an efficiency analysis on the LEVIR-CD dataset.For this purpose, the number of network parameters based on millions and training time based on minutes per epoch were used as an index to evaluate the models' efficiency.The result of comparing these parameters is demonstrated in Fig. 12.The method proposed in this article, which was based on the dual-branch network and uses parallel spatial-channel attention, had a relatively large number of parameters and little computational complexity, which are its limitation.Our proposed method was close to some methods such as EGRCNN and SNUNet in terms of the number of parameters.However, the training time in an epoch in our method was about 12 % less than the EGRCNN method.Besides, the training time of our method was only 2 % more than SNUNet and STANet methods, while the F1 parameter was 5.02 % higher than SNUNet and 5.43 % higher than STANet.In EGCNET, AGCDEN, and DDCNN methods, due to their complexity, the number of parameters was much higher and, therefore, the training time was much longer.In the EGRCNN method, although the number of parameters was about 2.6 million less than our method, because the network had the task of learning building changes and boundary estimation, it needed more time for processing.Furthermore, although the STANet method had fewer parameters compared to our method, due to the use of a pyramidal structure in the form of PAM, it required more time for processing.As discussed in the previous section, the results of the proposed model are very acceptable in both datasets; the fact that the F1 parameter of the proposed model is higher than other models indicates the proper performance of the our proposed method.

Comparison study on different ECC loss operators
In Section 2.4 and in Equation 9, it was mentioned that L1 was used to minimize the distance between the calculated and predicted bounds.Various functions can be used instead of L1 (Shi and Zhang, 2021); in some studies, MSE (Chen et al., 2022) has been used to calculate edge loss, and some studies have used L1.Moreover, the BCE function is a common method for calculating loss.Therefore, we made a comparison to first show the effect of using different types of loss operators to minimize the distance between the extracted and predicted edges.To this end, three operators L1, MSE, and BCE were considered as operators in the ECC loss function.That is, in Equation 9, we also used MSE and BCE besides L1 and analyzed the results.Table 6 shows the results of a comparative experiment that presents both the effects of using the ECC loss function versus using only WBCE, and the results of using different ECC operators in the building change detection accuracy.Based on the first row of Table 6, the use of ECC function compared to WBCE alone increased Pr, Re, and F1.The second to fourth rows of Table 4 are respectively related to the use of BCE, MSE, and L1 as operators in the ECC loss function; the use of L1 compared to the other two methods properly improved the accuracy.Of course, the use of BCE, MSE, and L1 caused an increase of 1.24 %, 2.02 %, and 2.21 %, respectively, in the F1 parameter.
Next, to show the performance of different loss functions on the learning process of the network, the graph of loss changes during the proposed network training for 4 different modes is depicted in Fig. 13.In the first case, which shows the WBCE loss without ECC, its convergence speed is significantly lower than when ECC-WBCE loss is used, and in the 100th epoch, the final loss value is higher than the other three cases.This shows that the use of ECC loss has increased the speed of network convergence.For the second, third, and fourth modes where the ECC-WBCE loss was implemented with three different operators, the convergence speed of BCE is lower than the MSE and L1 losses, and the almost largest distance is in epoch 20.Moreover, the convergence speed of L1 is slightly higher than MSE.BCE and MSE operators are almost equal after the 50th epoch, and the L1 mode is close to the other two operators after the 70th epoch.Therefore, as shown in Table 6 and Fig. 13, due to the better performance of L1 compared to the other two operators, we used L1 for the final ECC-WBCE loss function for the proposed network.
Finally, the visual results of implementing different loss operators in the ECC-WBCE function are shown in Fig. 14.The sampled images in Fig. 14show that all three operators have provided promising results.But as presented in Table 6, BCE operator has given slightly weaker results.The results of the MSE and L1 operators are very close, although the L1 operator has extracted the changes more completely and obtained better results on the edges.

Conclusion
This article suggested a dual-branch network with a parallel spatialchannel attention mechanism for building change detection in highresolution images.The spatial-channel mechanism led to the extraction of appropriate discriminator properties to identify changes in more detail, and their parallel fusion resulted in better performance than the concatenate fusion method.The ECC-WBCE dual loss function was proposed, which balanced the effects of the inequality between changed area and unchanged area on the network and improved the extraction of edges in changed buildings.The effectiveness of using the dual-branch network, the spatial-channel attention in the two fusion modes, and the proposed loss function were demonstrated step by step.The results revealed that our method increased the accuracy evaluation metrics in the two LEVIR-CD and BCDD datasets.Finally, five state-of-the-art methods were adopted for evaluating the proposed method and visually evaluated with the metrics' results.The results showed that using a dual-branch network with the parallel spatial-channel attention block and ECC-WBCE loss function extracted the small and large changes well and had the most consistency at the edges.In the upcoming research, we will attempt to increase the model's generalization by using various datasets, especially in crowded and dense construction areas.
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 1 .
Fig. 1.The overall frame of the proposed building change detection method based on a dual-branch network and parallel spatial-channel attention mechanism.

Fig. 2 .
Fig. 2. Overall architecture of the SAM.F is the input feature, K, Q, and V are key, query, and value parameters, respectively, Fs is the spatial attention module, and Fsa is the result of spatial attention actions on the input F.

Fig. 3 .
Fig. 3. Overall architecture of the CAM.F is the input feature, Fx is the channel attention module, and Fca is the result of channel attention actions on the input F.

Fig. 4 .
Fig. 4. Overall architecture of the proposed parallel spatial-channel attention block.X is the input feature, Fsa and Fca are respectively the spatial and channel attenuation mechanism is residual learning, and R is the last output of the parallel spatial-channel attention fusion.

Fig. 8 .
Fig. 8. Examples of change detection of our experiments on the BCDD dataset validation set.(a) unchanged image, (b) changed image, (c) label, (d and e) result of U-Net(ResnNet50) with binary cross entropy and dual loss, (f and g) result of U-Net(ResnNet50) by parallel spatial-channel attention mechanism with binary cross entropy and ECC-WBCE loss, (h and i) result of the proposed dual-branch network with binary cross entropy and ECC-WBCE loss and (j and k) result of the proposed dual-branch network by parallel spatial-channel attention mechanism with binary cross entropy and ECC-WBCE loss.White indicates the changed area and black indicates the unchanged area.

Fig. 9 .
Fig. 9. Examples of change detection of our experiments on the LEVIR-CD dataset validation set.(a) unchanged image, (b) changed image, (c) label, (d and e) result of U-Net(ResnNet50) with binary cross entropy and dual loss, (f and g) result of U-Net(ResnNet50) by parallel spatial-channel attention mechanism with binary cross entropy and ECC-WBCE loss, (h and i) result of the proposed dual-branch network with binary cross entropy and ECC-WBCE loss and (j and k) result of the proposed dual-branch network by parallel spatial-channel attention mechanism with binary cross entropy and ECC-WBCE loss.White indicates the changed area and black indicates the unchanged area.

Fig
Hao Chen and Zhenwei Shi introduced LEVIR-CD as an open-source building change detection dataset.It consists of 637 VHR images (0.5 m/pixel) with patch sizes of 1024 × 1024 pixels from Google Earth images.The images were taken from 2002 to 2018 and are from different cities in Texas, USA.Fig. 6 displays the geospatial distribution of the LEVIR-CD dataset.The images in the LEVIR-CD dataset are from different seasons of the year, and the brightness changes in the image pairs can be clearly seen.Therefore, the database is comprehensive for modeling real changes that reduce the effect of unreal changes due to factors such as seasonal changes.Positive building

Fig. 11 .
Fig. 11.Visual comparison between state-of-the-art methods and the proposed method on the LEVIR-CD dataset.
54 % and 1.03 %, which indicates the extraction of global information.The use of attention in a proper way has led to the extraction of long-range discriminative and global features.High Pr and Re have also increased F1, which has led to the refinement of the edges and complete segmentation of the building changes.The proposed model has increased the accuracy evaluation parameters well and simultaneously focused on the boundary's accuracy and the regions' integrity.Also, to better express the proposed method's performance, visual results for BCDD and LEVIR-CD datasets compared with other state-ofthe-art methods are shown in Fig. 10 and Fig. 11.The changes that were correctly detected (TP) are shown in white.The no changes that were correctly detected (TN) are shown in black.Red indicates pixels incorrectly detected as changed (FP), and blue indicates pixels incorrectly detected as unchanged (FN).In general, the results shown in Figs. 10 and Fig. 11 indicate the following points.

Fig. 12 .
Fig. 12. Demonstration of an efficiency analysis of the state-of-the-art methods.

Fig. 13 .
Fig. 13.The effect of different loss functions on the training of the proposed network and its convergence.I. is when the network only uses the WBCE loss function.II, III, and IV are for using ECC-WBCE loss function with BCE, MSE, and L1 operators, respectively.

Fig. 14 .
Fig. 14.Visual representation of the proposed method with ECC-WBCE loss function and BCE, MSE, and L1 operators on the LEVIR-CD dataset.

Table 1
Ablation study of parallel spatial-channel attention mechanism and ECC-WBCE loss on BCDD validation set.

Table 2
Ablation study of spatial-channel attention mechanism and dual loss function based on ECC-WBCE loss LEVIR-CD validation set.

Table 3
Hyperparameters used to run the state-of-the-art methods.

%) Recall (%) F1-Score (%)
changes or the building growth that is due to changes in land cover (such as soil, plants, buildings under construction into new buildings, or the removal of the building) are given in this dataset.The labeled LEVIR-CD contains 31,333 separate building changes, with an average of 50 buildings changed in 1024 × 1024 images.Most of the changes are new buildings constructed with approximately 987 pixels per image.Several images from this database are illustrated in Fig.7.

Table 6
Results of the effectiveness of using ECC and its different operators on the accuracy of building change detection on the LEVIR-CD dataset.