A Siamese Network Combining Multiscale Joint Supervision and Improved Consistency Regularization for Weakly Supervised Building Change Detection

Building change detection (BCD) from remote sensing images is essential in various practical applications. Recently, inspired by the achievement of deep learning in semantic segmentation (SS), methods that treat the BCD problem as a binary SS task using deep siamese networks have attracted increasing attention. However, similar to their counterparts, these approaches still face the challenge of collecting massive pixel-level annotations. To address this issue, this article presents a novel weakly supervised method for BCD from remote sensing images using image-level labels. The proposed method elaborately designs a siamese network to integrate a multiscale joint supervision (MJS) module and an improved consistency regularization (ICR) module into a unified framework to improve the so-called class activation maps (CAMs), which is vital for producing high-quality pseudomasks using image-level annotations to support pixel-level BCD. To be specific, the MSJ is used for generating refined multiscale CAMs to well capture changes at different scales corresponding to various buildings of varying sizes. The ICR contributes to improving the consistency of CAMs to highlight the boundaries of changed buildings. Extensive experiments on two public BCD datasets have demonstrated that the proposed method outperforms the current state-of-the-art approaches. Furthermore, the visual detection maps also indicate that the proposed method can achieve scale-adaptive change detection results and preserve object boundaries more effectively.

Weakly supervised semantic segmentation (WSSS), which can achieve SS results via nonpixel-level annotation labels, provides an opportunity for BCD. As far as we know, WSSS includes different forms such as image-level label [29], [30], [31], [32], [33], [34], [35], boundary box-level label [36], [37], and scribbles-level label [38]. Among   image-level label [as shown in Fig. 1(d)], which just indicates the existence of object classes in images but lacks information about their locations, has a fully fledged and widespread application in building extraction, landslide recognition, and vegetation health monitoring [29], [30], in which the class activation maps (CAM) play a vital role, and a high-quality CAM can bring more precise pseudolabels, thus producing better detection performance.
Therefore, our research motivation is as follows: on one hand, at the application level, there is an urgent need to develop a weakly supervised BCD method to address the problem of constructing pixel-level samples; on the other hand, at the technical level, WSSS based on CAM has inspired us and provided with an opportunity for BCD. However, directly introducing WSSS into BCD will face the following challenges.
1) First, images from different periods are affected by factors, such as seasonal variation, illumination conditions, satellite sensors, and solar altitude angles [39], [40]. It inevitably causes severe feature heterogeneity of unchanged-class and feature homogeneity of changedclass between two periods of images and has a negative impact on generating high-quality CAM. 2) Second, compared with other land cover objects, the size variability of changed buildings is more serious, and more scale variances will exist due to the partial expansion and reduction of buildings, as well as the construction and demolition of buildings [as shown in Fig. 2(a)-(c)]. 3) Finally, the change in buildings is universally accompanied by the change in surrounding other land cover objects, such as roads, vegetation, and bare land. So, it is unavoidable to result in blurry boundaries for changed buildings [as shown in Fig. 2 What is more, the CAM is difficult to cover the integral and accurate regions of changed buildings [as shown in Fig. 2(g)-(i)]; it further aggravates the phenomenon even more. There is no doubt that these issues will directly affect the quality of CAM and indirectly affect the performance of BCD. Hence, motivated by the ideas of self-supervised and multiscale operation [29], [30], we proposed a siamese weakly supervised network combining a multiscale joint supervision (MJS) module and an improved consistency regularization (ICR) module to improve CAM quality to realize the image-level-based WSSS BCD. Concretely, aiming at the environmental variation impacts from two-period images, we take the siamese network as the basic framework to extract the multiscale feature maps from two-period images simultaneously. Further, to cope with the seriously size-variant issue of changed buildings, the MJS module is used to directly supervise the multiscale changed feature maps to generate refined multiscale CAMs. Finally, to ameliorate the blurry boundaries issue caused by image-level weakly supervised mechanism and the symbiotic changes of nonbuilding, the ICR module is used to enhance the consistency of multiscale CAMs between original and affine transformation image pairs to improve the quality of boundaries in changed buildings. To sum up, the main contributions of this article are as follows.
1) A siamese network framework using image-level WSSS for BCD research is first proposed, and the framework can adopt various image-level WSSS methods via a simple modification to achieve BCD. 2) Aiming at the seriously size-variant issue of changed buildings, the MJS module is embedded into the BCD weakly supervised network to generate refined multiscale CAMs. 3) Aiming at the shortcoming of blurry boundaries caused by the image-level weakly supervised mechanism and the symbiotic change of nonbuildings, the ICR module uses the equivariant constraint mechanism to further enhance the consistency of multiscale CAMs for improving the quality of boundaries. The rest of this article is organized as follows. In Section II, the related work is introduced. The proposed method is presented in detail in Section III. Section IV presents the description and analysis of the experimental setting and experimental results. Section V contains some discussions. Finally, Section VI concludes the article.
II. RELATED WORK Until now, image-level-based WSSS BCD research has not been carried out, but it can be regarded as a combination of BCD and WSSS. Therefore, a brief review of the two fields will be introduced respectively, then a summary of WSSS BCD will be concluded in this section.

A. Building Change Detection
The BCD can be divided into artificial feature-based and deep learning-based methods. The artificial feature-based method mainly defines spectrum, morphological index, and other features to recognize the changed buildings from pixel or object level. For example, Huang et al. [6] proposed a method to carry out BCD from a pixel level using morphological building index, spectral, shape, and spatial information. Afterward, as image spatial resolution improves and image details progressively enrich, object-based BCD, which can effectively smoothen the salt-and-pepper phenomenon in pixel-based methods, has been gradually popularized. For instance, Xiao et al. [8] raised a concurrent segmentation and detection pattern as a new solution to object-based BCD. Although the artificial feature-based method has made good progress, it inevitably suffers from defects in the accuracy and completeness of changed buildings due to the weak robustness of artificial features and the uncertainty of the segmentation scale.
Deep learning can automatically extract robust semantic features at multiple layers and has become an effective way for BCD. For example, Liu et al. [20] proposed a dual-task constrained deep siamese convolutional network, which contains change detection and SS networks, to learn more discriminative features to achieve BCD. Furthermore, considering the multiscale characteristic of changed buildings, Xue et al. [41] used a multibranched network to fuse the multiscale feature of the changed buildings at different levels for BCD. Gao et al. [42] refer to the fusion ability of different scale feature maps in U-Net and proposed an object-level refinement network based on the idea of transfer learning for built-up area change detection. In addition, with the wide application of the attention mechanism [43], [44], [45], [46], [47], Jiang et al. [43] proposed a pyramid feature-based attention-guided siamese network to enhance the representation of multiscale information for BCD.

B. Weakly Supervised Semantic Segmentation Based on Image-Level Label
WSSS originates from natural image recognition and includes different forms, such as image-level label, boundary box-level label, and scribbles-level label [48], [49], [50], [51], [52], [53]. As mentioned earlier, we focus on the image-level label WSSS due to its fully fledged and widespread application [29], [30], [31], [32], [33], [34], [35]. For example, Kolesnikov et al. [51] proposed an SEC method containing seed, expand, and constrain three principles to refine the location information of objects in images. Ahn and Kwak [49] raised the AffinityNet to infer the similarity between adjacent pixels to generate a transition matrix and adjust the activation coverage of CAMs. Inspired by self-supervision that can narrow the supervision gap between fully and weakly supervised learning, some scholars integrated it into WSSS [54], [55], [56]. For instance, Fan et al. [54] presented an affinity self-supervision module for modeling the relationship among a group of images to learn activation maps from two different images containing the same class objects with the guidance of saliency maps. Furthermore, Wang et al. [29] proposed a self-supervised equivariant attention mechanism to produce CAMs with fewer overactivated and underactivated regions by constraining the outputs between the original image and affine transformation image and proved that the self-supervision mechanism using the consistency regularization can strengthen the CAM quality in object's boundaries.
Thanks to its progressive development in natural pictures [48], [49], [50], [51], [52], [53], [57], [58], [59], [60], WSSS has been gradually introduced into VHR remote sensing image-based classification, such as building detection, landslide extraction, infected-tree recognition, and so on [61], [62], [63], [64], [65], [66], [67]. For example, Qiao et al. [65] used a simple weakly supervised deep-learning method for individual red-attack tree detection. Li et al. [64] used a general CNN to produce CAMs and pseudomask labels, then an SS network considering CRF loss and classification loss is used to improve the performance of building extraction. Besides, Fang et al. [63] utilized the adversarial climbing and gated convolution strategy to generate class boundary maps and further refined building pseudomasks by fusing pairing semantic affinities and CAMs using the random walk. Furthermore, Yan et al. [30] proposed a WSSS method combining multiscale generation and super-pixel refinement to improve CAM quality for building detection. Even though the superpixel segmentation applied in the postprocessing stage has improved the quality of the CAMs on the boundaries, the performance is unstable due to the scale uncertainty in super-pixel segmentation.

C. Difficulties and Solutions of WSSS-Based BCD
It can be seen from the review of BCD that most methods for solving scale variance in BCD are through multiscale feature stacking or fusion but it certainly leads to an aliasing effect and is difficult to extract effective and brief multiscale information, especially for the changed buildings with partial expansion and reduction, as well as construction and demolition. At the same time, it can be seen from the review of WSSS that although several self-supervision mechanisms have been proposed, there is not yet a self-supervision method to further seek the implicit equivariant constraint information between the multiscale feature maps to further reinforce the boundary quality of changed buildings in CAMs Thus, aiming at the shortcomings of the above research works about size variability in changed buildings and blurry boundaries in CAM, we proposed a siamese weakly supervised network combining an MJS module and an ICR module to improve CAM quality to realize the image-level-based WSSS BCD. Specifically, to suppress feature heterogeneity of unchangedclass and feature homogeneity of changed-class between two periods of images, a siamese network is adopted to improve the generalization ability of the model; to capture refined multiscale CAMs for seriously size-variant changed buildings, a multiscale joint supervised module that directly supervises rather than fuses and stacks multiscale feature maps is introduced into the siamese weakly supervised network. Simultaneously, to further improve the CAM quality in the boundaries of changed buildings, an ICR module based on the self-supervision mechanism is used for equivariant constraint in multiscale feature maps of original image pairs, as well as in the feature maps at corresponding scale between original and affine-transformed image pairs, respectively. The combination of the two modules is effective for WSSS BCD, and it not only solves the seriously size-variant difficulty of changed buildings but also overcomes the shortcoming of blurry boundaries in changed buildings.

III. PROPOSED METHOD
In this section, the proposed image-level-based WSSS BCD method will be introduced in detail, which consists of a basic backbone network, MJS module, ICR module, as well as loss function definition and CAM generation module. And the overall framework is shown in Fig. 3.
Among them, the basic backbone network is a siamese subnetwork to extract the basic multiscale feature map pairs of two-period original images and two-period affine transformation images, respectively. The MJS module is used to generate multiscale changed maps using basic multiscale feature map pairs separately. The ICR module is used to further improve the CAM quality in the boundaries of changed buildings via equivariant constraint in multiscale changed feature maps of original images, as well as in the multiscale changed feature maps between original image pairs and affine transformation image pairs. The loss function definition and CAM generation module describe the formation of the loss function and the manner of CAM generation.

A. Basic Backbone Network
Specifically, the ResNet-50 [68] backbone is a siamese backbone. As shown in Fig. 3(a), it can output four different resolution feature maps containing diverse hierarchical feature information. It is noted that we have simply modified the ResNet-50 backbone, i.e., the step of the max-pooling layer in the original ResNet-50 backbone is changed from 2 to 1, which can retain a higher resolution for all feature maps.
For a pair of images I1, I2, with C × H × W , where C, H, and W indicate the number of channels, height, and width in original images, respectively, the size of basic feature maps at the first residual unit is (H/2 × W/2); the size of basic feature maps at the second residual unit is (H/4 × W/4); the sizes of basic feature maps at the third residual unit and the final residual unit are (H/8 × W/8). It is noted that we just use the second, third, and last residual unit outputs as the basic multiscale feature maps because the output of the first residual unit exists too much noise response, which may affect the continuity of follow-up CAMs. In the same way, to a pair of affine transformation images

B. Multiscale Joint Supervision
MJS not only avoids aliasing effects and tedious operations caused by feature fusion and stacking but can also directly supervise feature maps to generate high-quality multiscale CAMs of changed buildings. In detail, the mechanism of MJS is as follows. First, two periods of VHR images are input into our siamese backbone subnetwork to generate the basic multiscale feature maps, respectively. Then, the basic changed feature maps can be acquired by calculating the absolute value according to the corresponding channels of basic multiscale feature maps, respectively. Finally, the multiscale changed maps can be produced according to different MJS units, respectively. Specifically, the MJS unit composes of two convolution layers followed by a batch normalization layer, a rectified linear unit layer, and a dropout layer. And the kernel size of the two convolutions is both 1×1, as illustrated in Fig. 3(b). Since the number of channels in the basic multiscale feature maps is different, the number of input channels in each first convolution layer will change relying on the corresponding situation.
We assume the input images of two periods of VHR images are I1 and I2. And after the basic siamese network procedure, the basic multiscale feature maps of two periods can be obtained, named F n 1 and F n 2 , respectively, where n represents the output of the nth stage, and 1 and 2 represent two-period images, respectively; second, the basic multiscale changed feature maps can be acquired by calculating the absolute value according to the corresponding channels of the basic multiscale feature maps, respectively; finally, based on the basic multiscale changed feature maps, the multiscale changed maps DF n can be produced via two convolution layers followed by a batch normalization layer and rectified linear unit layer, and a dropout layer, respectively. Specifically, the overall calculation formula to obtain the multiscale changed maps is as follows: where (x, y) represents the pixel location of feature maps; MJS represents the operation of MJS; DF n (x, y) represents the multiscale changed maps of the nth layer in pixels (x, y).

C. Improved Consistency Regularization
The proposed ICR module aims to further enhance the CAM quality in the changed building boundaries by incorporating the idea of self-supervision, as shown in Fig. 3(c). As proved in [29], consistency regularization can provide an implicit equivariant constraint between the original images and affine transformation images for fewer overactivated and underactivated regions in CAMs. However, the consistency regularization mechanism in [29] just fulfilled the implicit equivariant constraint in the last feature map of the network between original images and affine transformation images, and never further sought the implicit equivariant constraint in multiscale feature maps in the original images, as well as in the same layer output of feature maps between original images and affine transformation, but both of which also include abundant implicit equivariant constraint information for further enhancing the CAM quality in the boundaries of changed buildings.
Therefore, considering the above shortcomings, the ICR module will further seek more abundant and effective implicit equivariant constraint information from two levels. On the one hand, we improve the consistency regularization from a single changed map to multiscale changed maps in the original image pairs. On the other hand, we perform consistency regularization in the changed maps at the corresponding scale between the original image pairs and affine transformation image pairs.
Concretely, we assume the outputs of MJS from the original image are DF n , and the outputs of MJS from affine transformation images are ADF n . So, the consistency regularization of the changed feature maps at the corresponding scale between the original and affine transformation image pairs can be described as follows: where A(·)represents the affine transformation operation, ADF n represents the nth layer changed feature maps of the affine transformation image pairs; * 1 represents L1 normalization calculation mode; CRI n represents the equivariant similarity of changed feature maps between the original image pair and the affine transformation image pair for the nth layer feature maps. It should be noted that the affine transformation images are obtained by downsampling with a factor of 0.5. Moreover, the transformation operation itself does not directly strengthen the network. Rather, it just implicitly transforms the spatial scale of images. It is the ICR module that plays a crucial role in enhancing the network performance, which employs a self-supervision mechanism to enforce equivariant constraints between multiscale changed maps of the original image pairs, as well as corresponding-scale changed maps between the original and affine-transformed image pairs, respectively.
And, the consistency regularization of multiscale changed feature maps in the original image pair can be described as follows: where CRF represents the equivariant similarity of multiscale changed feature maps in original image pairs; RS(·)represents resampling the changed maps from DF i to DF j ,and i and j represent the changed maps of different scales. It is noted that all feature maps of changed buildings must be normalized before consistency regularization operations.

1) Loss Function:
The loss function is composed of two parts, one is the MJS classification loss, and the other is the ICR loss. For the MJS classification loss, we can obtain the predicted labels via global average pooling operation for multiscale changed feature maps, then the multiscale joint classification loss can be obtained by the discrepancies between the predicted labels of different scales and the true label. So, the MJS classification loss can be described as follows: where h and w represent the height and width of changed building feature maps DF i , andŷ represents the true label about the situation of building change.
The ICR loss includes two parts of consistency regularization loss and can be described as follows, respectively. It is noted that according to the equivariant constraint operation, the equivariant similarities between changed maps can be obtained via the L1 normalization calculation mode, and if the value of L1 normalization is smaller, it indicates that the distribution of spatial and probability between changed maps is closer and the consistency is stronger.
The consistency regularization loss of the changed maps at the corresponding scale between the original and affine transformation image pairs is as follows: The consistency regularization loss of multiscale changed maps in the original image pairs is as follows: Finally, the overall loss function can be described as follows: 2) CAM Generation: After finishing the network training, we can generate the CAMs via dotting the product and gradients of backpropagation and specific feature maps. Specifically, we use the GradCAM++ technique [59] to generate multiscale changed building CAMs, and the formula is shown as follows: where CAM c x,y represents the class activation maps of changed buildings in pixels (x, y). F k x,y represents the kth target feature maps in pixel (x, y). In our article, the target feature maps are the output of the second convolution layer in each MJS unit. w c k represents the contribution score of the kth target feature maps to the changed building class, also called the backpropagation gradient of the kth target feature maps in the target convolution layer. Furthermore, the w c k can be calculated as the following formula: where ∂Y c represents the classification score belonging to the changed building class, and a k,c x,y can be calculated as follows: Depending on the above GradCAM++ technique, the multiscale changed building CAMs from low-level detailed information from high-level semantic features can be obtained in sequence. Finally, using the fusion strategy proposed in [30], we can merge the multiscale changed building CAMs into the final CAM by the CAM final = 1 3 CAM n , where n ∈ (23, 4) represent that we just use the CAMs generated from the second, third, and last residual units.

A. Dataset Descriptions
The proposed method will be evaluated on two public and large-scale datasets that are popular in BCD research, i.e., the WHU dataset [69] and LEVIR dataset [70]. Both of them are bitemporal images and exist some discrepancies in seasonal variations, illumination conditions, and satellite sensors.

The WHU dataset is an open BCD dataset and covers
Christchurch, New Zealand, from 2012 to 2016 and consists of one pair of aerial images of size 32 507 × 15 354 with 0.075 m spatial resolution, as shown in Fig. 4. The dataset has the following challenges.
1) It includes abundant multiscale changed buildings, such as partial expansion and reduction of buildings, and construction and demolition of buildings. 2) Also, it consists of various shapes and types of changed buildings, such as circular tank buildings, rectangular warehouses, and irregular residences with different color roofs. 3) Moreover, the irrelevant changed objects heavily cover in images, such as parks, roads, and cars. The LEVIR dataset is an open BCD dataset and comes from 20 different regions that sit in several cities in Texas of the U.S. from 2002 to 2018, and consists of 637 very high-resolution Google Earth image pairs with 0.5 m resolution, 1024 × 1024 pixels, as shown in Fig. 4. The dataset has the following challenges.
1) The boundaries of buildings are blurry due to the low spatial resolution of images.
2) The shadow between buildings is serious due to the large solar altitude angle and the dense distribution of buildings. 3) Moreover, the size of buildings is generally small, such as the small size of villa residences and apartments.

B. Evaluation Metrics and Implementation Details
To .
For the experimental data, we clip each image pair to 256 × 256 pixels size, also select the ratio of changed building pixels to all pixels in label images over 0.14 as positive image-level samples and the ratio of unchanged building pixels to all pixels in label images to 1 as negative image-level samples. Then, we can obtain 10 514 and 7959 image pairs of 256 × 256 pixels size on the WHU dataset and LEVIR dataset, respectively. Finally, we select 980 positive samples and 6188 negative samples as the training dataset, and 4368 samples as the test dataset on the WHU dataset (as shown in Fig. 5). Also, we select 958 positive samples and 3953 negative samples as the training dataset, and   2048 samples as the test dataset on the LEVIR dataset, as shown in Fig. 5.
For the proposed network, we use ResNet-50 pretrained by the ImageNet dataset as the backbone and slightly modify it according to the above description. For the BCD network, we use DeepLabV3+ [71] pretrained by the ImageNet dataset as the backbone to achieve BCD SS. Specifically, it can be shown in Fig. 6, first, we extract the basic feature maps of two-period images and the low-level feature maps by the backbone, which can be named F 1 , F 2 , low − F 1 , and low − F 2 , where 1 and 2 represent images of two periods. Second, the basic changed feature map can be obtained by calculating the absolute value of F 1 and F 2 . Third, the basic changed feature map is input into the ASPP, which is the same as the original DeepLab V3+ network. Next, the low-level changed feature map can be obtained by calculating the absolute value of low-F 1 and low-F 2 . Finally, the basic changed feature map and the low-level changed feature map are concatenated and input into subsequent subnetworks, which is also the same as the original DeepLab V3+ network. Among them, we adopt the SGD with momentum 0.9, weight decay 5e-4, and learning rate 0.0001 as an optimization algorithm. The training epoch is set to 100 epochs with four batch sizes. Our experiments are performed in Pytorch platform using a NVIDIA RTX3080 GPU with 12 GB memory.

C. Ablation Experiments
As described above, since the main contribution of our method reflects in improving the CAM quality, the ablation experiments will just explore the effectiveness of two modules in generating CAMs. First, the MJS module and the ICR module in our method are removed as the baseline network, named Baseline. Second, the MJS module is added into the baseline network, named Baseline+MJS method, and it is used to explore the effect of the MJS module in generating CAM. It is noted that the final CAM generation way is the same as our method. Third, the ICR module is embedded into the baseline network, named Baseline+ICR method, and it is used to explore the effect of the ICR module in generating CAM; it is noted that the second and third residual  unit outputs are no longer supervised and just the last residual unit output is supervised but the ICR also is used in multiscale changed feature maps in the original image pairs and the changed feature maps at the corresponding scales between the original image pairs and affine transformation image pairs. Finally, the MJS and ICR modules are integrated into the baseline network together, i.e., our proposed method. Table I about the accuracy of CAM, it can be found that compared with the Baseline, the Baseline+ICR has achieved a good improvement in CAM for the WHU dataset, in which the raised accuracies reach about 4.4% and 4.9% at F1 and IoU, respectively. Although the Recall drops by 0.6%, the Precision significantly increases about 7.1%, and a more balanced result between recall and precision can be obtained. It shows the ICR module plays a positive role in improving the quality of CAM.

1) Evaluation for ICR Module in the WHU Dataset: As given in
In addition, some visual examples of CAM have been displayed in Fig. 7, which also provide a similar conclusion as to the quantitative analysis. Specifically, compared with the Baseline, the overactivated areas at the boundaries of changed buildings become less, and the high-activated region inside the changed buildings becomes more continuous and concentrated. It indicates that the ICR module can improve the consistency of CAM via equivariant constraint to implicitly ameliorate the shortcoming of blurry boundaries. However, it cannot be ignored that the absolute accuracies in F1 and IoU are too low to be unacceptable.
2) Evaluation for MJS Module in the WHU Dataset: As given in Table I about the accuracy of CAM, it can be found that compared with the Baseline, the Baseline+MJS has achieved a great improvement in CAM for the WHU dataset, in which the raised accuracies reach about 14.1%, 9.1%, and 10.6% at Precision, F1, and IoU, respectively. It shows the MJS module also plays a significant role in improving the quality of CAM.
Moreover, it can be found from Fig. 7 that compared with the Baseline, almost all multiscale changed buildings have been activated in Baseline+MJS, especially the large-scale changed buildings. It indicates that the MJS module can capture the size-variant changed buildings. However, although the blurry boundaries of changed buildings in CAM have also been alleviated, it cannot be ignored that some low and middle activated still occur in the boundaries of changed buildings. Table I about the accuracy of CAM for the WHU dataset, on the one hand, it can be found that the Baseline+MJS outperforms the Baseline+ICR, and the improvement reaches about 4.7% and 5.7% at F1 and IoU, respectively, which indicates that the improvement of MJS module is better than ICR module. On the other hand, it can be found that the combination of ICR and MJS modules has achieved satisfactory results in CAM, and the accuracies reach 78.86% and 65.09% at F1 and IoU, respectively, which is also superior to the Baseline+ICR and Baseline+MJS, and shows that the combination of ICR and MJS modules is very effective.

3) Evaluation for the Combination of ICR and MJS Modules in the WHU Dataset: As given in
From Fig. 7, it can be found that plentiful overactivated and underactivated regions in CAMs have been avoided using the combination of ICR and MJS modules. Specifically, the shortcoming of the blurry boundaries has been further alleviated, and the integrity of changed buildings is further enhanced. Besides, different scales of changed buildings are also detected including large warehouses and small residences. In addition, we can find that although the performance in the boundaries of changed buildings has been improved in the Baseline+MJS, it can further be enhanced with the help of ICR module by decreasing the region of low and middle activated in the boundaries, which proves that the ICR module can play an implicit role in improving the quality of boundaries. Table II, it can be found that compared with the Baseline, the Baseline+ICR has achieved a good improvement in CAM, in which the raised accuracies reach about 4.6% and 5.2% at F1 and IoU, respectively. It obtains similar improvement as to the WHU dataset. However, it noted that the improvement for the LEVIR dataset reflects in the Recall, which is different from the WHU dataset that reflects in the Precision. The main reason for the discrepancy is that the changed buildings are too small to be detected via Baseline on the LEVIR dataset, but with the help of ICR module, the consistent information between high-resolution and low-resolution feature maps can be captured, which improves the recognition ability for small changed buildings.

4) Evaluation for ICR Module in the LEVIR Dataset: As given in
In addition, it can be found from Fig. 8 that some overactivated areas in CAM have decreased, such as roads and trees. Besides, the activated areas are gradually concentrated inside the changed buildings, and some shadow areas between buildings also become low activated. Table II, it can be found that compared with the Baseline, the Baseline+MJS has achieved a great improvement in CAM, in which the raised accuracies reach about 8.4%, 8.4%, 12.5%, and 12.2% at Recall, Precision, F1, and IoU, respectively, which prove that the MJS modules are also effective to improve the quality of CAM.

5) Evaluation for MJS Module in the LEVIR Dataset: As given in
Moreover, it can be found from Fig. 8 that lots of overactivated areas have decreased greatly, high-activated areas almost no longer appear on the roads and trees, and all small changed buildings have been activated, all of which show that MJS module cannot only suppress the irrelevant overactivated area but also strengthen the underactivated changed buildings area. Table II, it can be found that the result for the LEVIR dataset shows a similar conclusion as to the WHU dataset. On the one hand, the Baseline+MJS outperforms the Baseline+ICR in CAM, and the improvement reaches about 6.9% and 7.0% at F1 and IoU, respectively; on the other hand, the combination of ICR and MJS modules has achieved very good results in CAM, and the accuracies reach 69.18%, 52.88% at F1 and IoU, and shows that the combination of ICR and MJS modules is also effective for the LEVIR dataset.

6) Evaluation for the Combination of ICR and MJS Modules in the LEVIR Dataset: As given in
Furthermore, we can find from Fig. 8 that plentiful overactivated and underactivated regions in CAMs have been avoided using the combination of ICR and MJS modules. Specifically, high-activated and low-activated areas no longer exist in roads and trees, and the interior of changed buildings is activated highly especially the small changed building. Also, the shortcoming of blurry boundaries has been further alleviated, and the individual changed buildings can be separated from the dense building groups, and few shadow areas are activated.
However, it cannot be ignored that the performance in the two datasets has an unexpected gap. The performance on the LEVIR dataset is worse than the WHU dataset, and the gap reaches almost 9.7% and 12.2% in F1 and IoU, respectively. There are two main reasons for the phenomenon: one is the resolution of the images on the LEVIR dataset being too low and results in the spatial information being inadequate [as shown in Fig. 9(a)]; the other is that the distribution of changed buildings is tight and leads to severe shadow between changed buildings [as shown in Fig. 9(b)]. Both of which aggravate the difficulty of generating high-quality CAM.

D. Comparison Experiments
As mentioned in Section II, although some image-level-based WSSS methods have been proposed, many of them do not improve the quality of CAM in essence instead of just performing postprocessing operations on CAM. Therefore, in order to prove the effectiveness of our method from a fair point of view, seven image-level-based WSSS methods for improving   9. Detailed image example on the LEVIR dataset. It can be found from the T2 detailed images that the spatial resolution is low and the boundary of building is burry; it can also be found from the T2 detailed images that there is a severe shadow between buildings. the quality of CAMs essentially, such as L2G [33], WILD-CAT [31], PANet [34], DRS [35], WSF [32], SEAM [29], and MSG-SR-Net [30], are selected as the comparison methods; all of these have been practiced in natural or VHR images recognition. And the comparison experiments about the CAM generation and BCD SS will be performed on the WHU dataset and LEVIR dataset, respectively, to demonstrate our method's superiority. On the one hand, the CAMs will be generated via the above methods, then evaluation metrics about CAMs will be calculated. On the other hand, according to the above CAMs and division thresholds, the pseudolabels can be produced to train the BCD SS network, then evaluation metrics about SS results will be calculated.
In the CAM generation, two modifications need to be noted. On the one hand, there are some postprocessing ways for improving the CAM quality in the compared methods but those postprocessing ways are not applicable to BCD research, so we abandon the postprocessing ways to directly carry out the compared experiments about CAM. On the other hand, it is well known that those compared methods are based on a single-period image to perform related research works, which is different from the BCD research. Therefore, in order to maintain uniformity, our BCD framework, i.e., siamese network, will be used in the compared methods to obtain changed feature maps for generating CAM.
In the BCD SS, it is noted that before evaluating the BCD result, the generation method of pixel-level pseudolabel for changed buildings and nonchanged buildings will be introduced in advance. As we know, the quality of pixel-level label will directly influence the detection performance in changed buildings. If only one division threshold in CAMs is used to generate pseudolabels of changed buildings and nonchanged buildings, it will cause severe false labels due to large pixels with noncommittal class. So, it is necessary that two division thresholds are selected to generate the pseudolabel of changed buildings and nonchanged buildings, respectively. However, it is also very troublesome to select the two division thresholds by directly traversing between 0 and 1 in CAMs separately and then train the BCD network repeatedly to gain the optimal two division thresholds. Therefore, we use the following method to determine the two division thresholds. First, we select almost ten image pairs in the above image-level training samples to interpret changed buildings at pixel level; second, the above-mentioned pseudolabels will be obtained via CAMs under different division thresholds; next, all F1 under different division thresholds will be calculated according to the pixel-level changed building samples; then, the single optimal division threshold will be gained based on the optimal F1; finally, the two division thresholds can be obtained in the interval near the single optimal division threshold via repeatedly training and testing in those ten image pairs. Eventually, if the value in CAMs is greater than the value for generation changed buildings in two division thresholds, the pixel can be divided into the changed building pixel-level samples, and if the value in CAMs is smaller than the value for generation unchanged buildings in two division thresholds, the pixel can be divided into the unchanged building pixel-level samples. According to the above-mentioned generation method of the pixel-level pseudolabel, we can obtain the optimal two division thresholds of L2G [33], WILDCAT [31], PANet [34], DRS [35], WSF [32], SEAM [29], MSG-SR-Net [30], and our method in two BCD datasets, respectively.
1) CAM Result of the WHU Dataset: From the quantitative perspective, it can be seen from Table III that our method achieves the best CAM performance for the WHU dataset. Among them, the F1 and IoU reach 78.86% and 65.09%, respectively, which is a remarkable improvement compared with other methods. In recall, we can find that the DRS obtains the best performance. In precision, we can find that our method obtains the best performance. At the same time, it is noted that our method achieves a more balanced result between recall and precision and also obtains the best F1 and IoU performance. The improvement reaches about 1.5% and 2.1% at least compared with other methods, which demonstrates that our method is effective in CAM generation and outperforms the other methods for the WHU dataset. Moreover, it can be seen that the performance of L2G is terrible, and the IoU just reaches 38.63%.
From the visual perspective, it can be seen from Fig. 10 that the results from L2G, WILDCAT, WSF, PANET, SEAM, and MSG-SR-Net have received much underactivated area in the internal of changed buildings, as well as much overactivated area in the nonchanged buildings. And our method gains the highestquality CAMs in changed building and nonchanged building areas. In detail, compared with other methods, our method can activate multiscale changed buildings, including big warehouses and small residences. Moreover, compared with other methods, our method can activate nearly intact changed buildings, such as the circular tank building and rectangular building. To the best of our knowledge, the reasons for those positive effects lie in two parts. On the one hand, the MJS can extract effective size-variant information of changed buildings to improve the multiscale response ability of changed buildings; on the other hand, the ICR can further reinforce the quality of boundaries in CAMs by the implicit equivariant constraint in the multiscale feature maps as much as possible.
2) BCD Result of the WHU Dataset: From the quantitative perspective, it can be seen from Table III that our method has achieved the best results in BCD SS; F1 and IoU reach 72.79% and 57.22%, respectively, for the WHU dataset, which increases about 3% and 3.7% in F1 and IoU at least compared with other methods. And the precisions have gained the highest accuracy in all methods, which increases by about 5.6% at least. Although the recall is unable to receive the best result and the margin  6.3% compared with the SEAM, our method gains a desired improvement in precision and achieves a more balanced accuracy. Moreover, it can be found from Table III that the accuracy discrepancies of BCD between all methods are larger than the accuracy discrepancies of CAMs between all methods, which demonstrate that the quality of CAMs has a serious effect on the performance of BCD, more high-quality CAMs will generate more accurate pseudo labels and can further enhance the BCD performance.
From the visual perspective, it can be seen from Fig. 11 that our method performs better in the integrity and boundary of changed buildings. On the one hand, the changed buildings via our method have received considerable geometric shapes corresponding to the labels. On the other hand, our method can detect multiscale changed buildings with very small and very large sizes in the WHU dataset. However, other methods exist serious speckle noise phenomenon in the interior of changed buildings and have a bit of sawtooth situation in the boundary of changed buildings, both of which are derived from the overactivated and underactivated CAMs. So, it can be known that the advantages of our method benefit from the ICR and MJS modules, the former can generate multiscale CAMs, and the latter can further improve the boundary quality of CAMs. For these advantages, the performance of BCD segmentation via our method can achieve a satisfactory result.
3) CAM Result of the LEVIR Dataset: From the quantitative perspective, it can be seen from Table IV that our method achieves the best CAM performance for the LEVIR dataset, the F1 and IoU reached 69.18% and 52.88%, and there is a remarkable improvement compared with other methods. In recall, we can find that the WSF obtains the best performance. In precision, we can find that our method obtains the best performance. In F1 and IoU, our method improves by 2.0% and 2.3% at least compared with other methods. The above results demonstrate that our method is also effective in CAM generation and also outperforms the other methods for the LEVIR dataset. Moreover, it can be seen that the performance of L2G is terrible, and the IoU just reaches 23.18%.
From the visual perspective, it can be seen from Fig. 12 that the results from L2G, WILDCAT, WSF, PANET, SEAM, and MSG-SR-Net have received much underactivated area between changed buildings, which results from the serious shadow next to the buildings, but our method can separate and activate individual changed buildings in dense buildings groups. Moreover, in other methods, the high-activated regions are very divergent around the changed buildings, but our method can concentrate the high-activated area inside the changed buildings. To the best of our knowledge, the main reasons for the positive effects are that the ICR can further capture the detailed information from multiscale feature maps to improve the performance of CAM in the boundaries between changed buildings and shadows.

4) BCD Result of the LEVIR Dataset:
From the quantitative perspective, it can be seen from Table IV that our method has achieved the best results in BCD SS; the F1 and IoU reach 67.41% and 50.84% for the LEVIR dataset, which increases at least about 2.6% and 2.9% in F1 and IoU compared with other methods. And the precisions have also gained the highest accuracy in all methods, which increases by about 8.2% at least. Although the recall is unable to receive the best result and the margin reaches 14.2% compared with the SEAM, our method gains the desired improvement in precision and achieves a more balanced accuracy.
From the visual perspective, it can be seen from Fig. 13 that our method performs better in the integrity of changed buildings. Specifically, other methods exist in serious overactivated areas generally, especially between the changed buildings. There are two reasons for this phenomenon: on the one hand, the previous CAMs exist in lots of overactivated areas, which lead to the pseudolabel being inaccurate; on the other hand, the LEVIR dataset is challenging due to the low spatial resolution of images and severe shadow between changed buildings. However, in our method, benefitting from the superiority of CAM, the pseudolabel is more exact and can provide a more accurate training process. So, our method can obtain the best performance in BCD.

5) Computational Complexity:
In order to compare the computational complexity of different methods, we select the parameter size, training time, and inference time of the model as the evaluation metrics. And the results are given in   V  SIZE OF PARAMETER AND THE TIME COMPLEXITY OF DIFFERENT MODELS phase, our model took longer time due to the high computational complexity of the affine transformation operation. Additionally, in the inference phase, our model and MSG-SR-Net took the longest time due to the extra computation required for the combination of multiscale changed maps. Furthermore, the size of parameters and the inference time for our model are the same as MSG-SR-Net as the frameworks between the two models are similar.

V. DISCUSSION
In this section, our discussion will focus on two aspects. On the one hand, why our method outperforms other methods; on the other hand, why our results in comparison experiments differ significantly from those of weakly supervised building extraction.
Aiming at the former issue, first, we find that our method shows significant improvement compared with other methods, except for MSG-SR-Net. We can see through CAM that our method receives less underactivated area in the internal of changed buildings, as well as less overactivated area in the nonchanged buildings, which indicates that MJS module plays a crucial role in the multiscale changed buildings. Moreover, we can find that our method has a slight improvement compared with the MSG-SR-Net. We can see through CAM that our method captures good responses in the boundaries of changed buildings with varying shapes, which indicates that MJS module can reinforce the boundary quality of CAM by the implicit equivariant constraint in the multiscale feature maps as much as possible.
Aiming at the latter issue, Yan et al. [34] have claimed in weakly supervised building extraction that PANET achieves the best performance and SEAM performs worse than WSF. However, it is quite different from the results of our comparison. Through thorough analysis, we have concluded two main reasons: the first is that the size variability of changed buildings is more serious, and more scale variances will exist due to the partial expansion and reduction of buildings, as well as the construction and demolition of buildings, which make greater demands on the multiscale ability of the model. The second is that the images from different periods in BCD are susceptible to the influences of factors, such as seasonal variation, illumination conditions, satellite sensors, and solar altitude angle, and cause severe feature heterogeneity of unchanged class and feature homogeneity of changed-class between two periods of images, which makes greater demands on the robustness ability of the model. Therefore, we cannot simply modify the methods of WSSS building extraction into WSSS BCD. It is necessary to fully consider the inherent characteristics of changed buildings and design reasonable solutions.

VI. CONCLUSION
In the article, we proposed a siamese weakly supervised network that combines an ICR module and an MJS module to strengthen the quality of CAMs for image-level-based WSSS BCD. Comprehensive experiments have been carried out on the WHU dataset and LEVIR dataset. For the CAM accuracy, the F1 and IoU reach 78.86% and 65.09% on the WHU dataset, and the F1 and IoU reach 69.18% and 52.88% on the LEVIR dataset. For the SS accuracy, the F1 and IoU reach 72.79% and 57.22% on the WHU dataset, and the F1 and IoU reach 67.41% and 50.84% on the LEVIR dataset. All of these have demonstrated that our proposed method is effective in WSSS BCD and outperforms the current state-of-the-art WSSS methods. Especially, the MJS module and the ICR module not only solve the difficulty of seriously size-variant issue in changed buildings due to the partial expansion and reduction of buildings, as well as the construction and demolition of the building, but also overcome the shortcoming of blurry boundaries in changed buildings caused by the image-level weakly supervised mechanism and the nonbuilding symbiotic changes.
Although our proposed method achieves acceptable performance similar to strong supervision (pixel-level samples), there are still some limitations. First, while the method can identify almost all of changed buildings, the absolute accuracy of changed buildings is not high enough. Second, the boundaries of changed buildings are not perfect, mainly due to the limitations of the CAM in covering the absolute boundaries of changed buildings, which may affect the accuracy of pixel-level samples for BCD. Finally, the proposed method is not effective in identifying small-sized changed buildings, mainly because CAM struggles to detect and respond to such small buildings.
Consequently, drawing inspiration from building extraction techniques that optimize CAM through a series of postprocessing methods, such as super-pixel segmentation and CRF processing, our subsequent research aims to develop a postprocessing method that aligns with WSSS BCD to improve the quality of CAMs in terms of boundary and scale. Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. Fig. 11. BCD results of different methods on the WHU dataset. White represents the changed buildings that are correctly detected, the red represents the changed buildings of false alarm, and the green represents the changed buildings of missing alarms. Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. Fig. 13. BCD results of different methods on the LEVIR dataset. White represents the changed buildings that are correctly detected, the red represents the changed buildings of false alarm, and the green represents the changed buildings of missing alarm.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.