3M-CDNet-V2: An Efficient Medium-Weight Neural Network for Remote Sensing Image Change Detection

Remote sensing-based change detection (CD) is a critical technique of detecting land surface changes in earth observation. Inspired by recent success of a lightweight CD network– 3M-CDNet, we implemented 9 meaningful modifications to it, for example, the incorporation of MHSA (Multi-Head Self-Attention). This elaborately designed model is termed as 3M-CDNet-V2. Its effectiveness and advantages were demonstrated on three engineering CD datasets, and experimental results indicated that: (1) relative to other state-of-the-art algorithms, the proposed model obtained very competitive or slight better performance on both visual comparison and quantitative metrics evaluation. (2) By applying a novel transfer learning strategy, 3M-CDNet-V2 could perform well on the small dataset. (3) The incorporation of MHSA brings a substantial accuracy improvement while moderately increasing the computational complexity. (4) The late fusion framework and deep supervision contribute most to the performance gain of 3M-CDNet-V2, while the introduction of low-level features to the classifier by skip connection could guide the model to focus on detailed spatial information such as small changes, narrow shaped objects, or accurate boundaries. We hope that our 3M-CDNet-V2 model helps in improving the understanding of network architecture design for CD.


I. INTRODUCTION
Change detection (CD) based on remote sensing (RS) technology is widely used to discover and identify differences in ground objects using two or more images in the same geographical location [1]. The development of this technology has attracted the interest of many researchers in the RS community, it has been applied in many fields, such as illegal construction identification, resource surveys, disaster monitoring, urban planning, mine-environmental investigation [2], and so on. Benefiting from the rapid development of high spatial resolution and multi-temporal RS earth observation, high-resolution multispectral images have gradually become the primary data source for many RS applications, especially The associate editor coordinating the review of this manuscript and approving it for publication was Muhammad Sharif . change detection [1], [3]. However, although the fine contextual information and complex spatial characteristics that high-resolution images convey offer rich spatial details for land use/cover change analysis, they face several serious challenges too [4]. Difficulties arise from the enhanced intra-class variabilities [5], the spatial displacement due to the parallax distortion of ground objects (especially for high-rise buildings) [6], the confused spectral features of different objects, complex scenarios, illumination, camera motion, shadows, misregistration error, and etc.
Due to the great advantages in deep feature representation and non-linear problem modeling [5], deep learning technology has opened up new opportunities for CD tasks to address the problems above-mentioned. For example, the CNN (i.e., Convolutional Neural Networks) architecture itself could achieve some degree of shift, scale, and distortion VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ invariance [7]. Over the past years, enormous efforts have been made to use deep learning-based methods for CD using high-resolution RS images, readers are referred to [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], and [20]. A wide variety of research has indicated the superiority of these methods over the traditional machine learning or unsupervised methods, e.g., Support Vector Machine, Random Forest, Slow Feature Analysis [21], Robust Change Vector Analysis [22], and so on. This is because they could learn representative and discriminative features from a vast array of samples [1]. At present, Siamese CNN and Transformers are two main-stream deep learning approaches and research directions proposed for CD. Beginning with UNet-based models namely FC-EF, FC-Siam-conc, and FC-Siam-diff [10] and their revolutionary performance on the classical CDD and BCDD datasets [5]. Siamese CNN architectures have evolved to become increasingly powerful through more dense skip connections [5], larger scale [11], more sophisticated forms of convolution [14], semisupervised architectures [12], [13], as well as hybrid loss function [15]. At the same time, the Transformer-based network architectures used for CD have also received increased attention [23], [24], [25], and they achieved competitive performances compared to their CNN counterparts. By leveraging the multi-head self-attention, Transformer has larger receptive fields and is notable for its use of attention to model long-range dependencies in the data [25]. On the contrary, most advanced CD pipelines using CNNs are still struggling to relate long-range concepts in space-time because of their inherent limitation to the size of the receptive field [23], [26]. Even so, Transformers have their inherent limitations, e.g., the low degree of computational and memory efficiency [7], the lack of convolution-like inductive biases [27], and so on. Inasmuch, the models that interleave or combine CNN and Transformer in a ''hybrid'' way have fueled successful results [27] in a variety of computer vision tasks including object detection, image classification, and semantic segmentation, such as ConViT [28], TransCNN [29], ViT [30], CvT [7], and so on. However, there are very few studies examining the use of such hybrid methods in the realm of CD. In a more recent work [24], a transformer architecture is applied in conjunction with ResNet to enhance the feature representation while keeping the overall CNN-based feature extraction process in place. In this article, we will report a novel Siamese CNN-self attention combined architecture (termed 3M-CDNet-V2) for change detection tasks using bitemporal high-resolution imagery. As its name implies, 3M-CDNet-V2 is the updated version of 3M-CDNet [11] which is a lightweight CD network only involving about 3.12 M trainable parameters. 3M-CDNet was designed by reducing the width and depth of the backbone network, thus producing high-resolution feature maps, and it facilitates the detection of detailed spatial information (including small changed objects) with acceptable computational costs. In addition, 3M-CDNet is incorporated with deformable convolutions to promote the geometric transformation modeling ability. It makes a better trade-off between accuracy and inference speed compared with existing methods. However, in practical applications, we found that 3M-CDNet has some limitations as well, such as the vanishing-gradient problem, suboptimal improvement in model performance, difficulties in model training, limited power of deformable convolutions, and so on. We also found that these limitations could be minimized by a series of modifications, such as replacing several deformable convolutions in 3M-CDNet with Multi-Head Self-Attention (MHSA) layers that implement global self-attention over a 2D feature map [31]. These findings have motivated us to develop an updated version of 3M-CDNet by combining CNN and MHSA into a singular network.
In addition, the classic CNNs or CNN-like ViT networks often involve too many pooling or down-sampling operations, and the spatial details in deep features are lost. To solve this issue, the new generation of CNN models usually emphasizes the importance of retention and fusion of both the spatial detail information and semantic information, such as IFN [16], SNUNet-CD [17], etc. Moreover, the classic ViT network [30] does not perform down-sampling operations, but it can enrich the high-level semantic information with detailed spatial features. On the other hand, if the training and inference of a model were directly based on high-resolution feature maps, the gaps between detailed spatial information and semantics could be bridged or at least narrowed. 3M-CDNet is such an example -the feature maps to be fed into the classifier are a quarter of the original image size, and 3M-CDNet-V2 goes even deeper -half the size of the original image.
As mentioned, to obtain better prediction accuracy, there is a trend to combine the strengths of CNNs and ViT (MHSA) [7]. There are three possible ways to do this: (1), most recent ViTs, such as CvT [7] and ConViT [28], adopt a hierarchical architecture, which is widely used in CNNs, e.g., AlexNet and ResNet. These models apply the global MHSA and its variants on the high-resolution tokens, thus bringing heavy computation cost. (2), To improve the efficiency, people attempt to compute MHSA within the local/windowed region like CNNs do, for example Swin Transformer [32], Cross-Swin Transformer [33], and TransCNN [29]. (3), Using a CNN-based module (i.e., Token Pyramid Module), the TopFormer model proposed in [34] takes Tokens from various scales as input to produce scale-aware semantic features. 3M-CDNet-V2 provides a different look: it directly replaces the spatial deformable convolutions with MHSA in the final layers of a CNN backbone, and our aim is to obtain rich semantics and large receptive field.
In this article, we make the following contributions: (1) Inspired by the recent success of the light-weight CD network: 3M-CDNet, we made a series of modifications on it, which brings a significant performance boost.
(2) We replace the spatial deformable convolutions with MHSA in the final layers of 3M-CDNet, so as to learn longrange dependencies in the bitemporal images.
(3) A new transfer learning strategy was applied to facilitate the application of 3M-CDNet-V2 on small datasets.
The rest of this article is organized as follows. Section II describes the related work on 3M-CDNet and gives the details of our proposed architecture. The comparative experimental results and ablation studies are reported detailedly in section III. The discussion is given in section IV and conclusion is drawn in section V.

A. A BRIEF INTRODUCTION OF 3M-CDNET
As shown in Fig. 1, 3M-CDNet mainly consists of four components: Input Layer, Layer 1, Layer 2, and Classifier. The former three are used to extract deep features from the input image I (1,2) ∈R 6×H×W which is a concatenation of the pre-change image and post-change image along the dimension of channels. Therein, as shown in Figure 2, Layer 1 and 2 are ''modulated deformable convolution'' (MDConv)-based backbone networks [35]. The latter one which is a pixel-wise classifier is used to classify the extracted features into two classes in a change probability map [3]. Input of the model is a six-band fused RGB image, and the output is a binary change map CM∈R 1×H×W , where pixels are either unchanged or changed. 3M-CDNet is a lightweight CD network for remote sensing images, and it only involves about 3.12M trainable parameters.
Using 3M-CDNet, change detection can be implemented based on high-resolution feature maps (at least 1/4 of the original image size). Reducing the width and depth of the backbone network also decreases the number of model parameters, thus creating high-resolution and lightweight deep features. At the same time, however, the receptive field of these deep features is limited due to the limitation of the backbone network's depth, and the completeness of the contextual semantics in deep features cannot be kept [36], leading to weak feature representation. Fortunately, MDConv, which is a powerful mechanism to attend to flexible spatial locations conditioned on input data [35], is designed to capture deformable context from objects with various shapes and scales, thus providing a feasible solution to alleviate these problems. In other words, learning a deformable receptive field for the convolution filters has been proved effective in selectively attending to more informative regions in the feature maps [37].     [11]). Therein, convolution layers are denoted by ''number of kernels of each filter, kernel size, number of filters'', e.g., ''128, Conv1 × 1, 64''. Conv and MDConv indicate the 2-D convolution layer and modulated deformable convolution layer, respectively.

1) CHANGE THE EARLY-FUSION FRAMEWORK AS LATE-FUSION
In Fig.1, 3M-CDNet adopts the early fusion strategy to treat change detection issues as exclusively semantic segmentation problems. This strategy [1], however, has been proven to be sensitive to noisy conditions such as geometric distortion and different viewing angles, and often leads to dropped accuracy [1]. Inasmuch, as displayed in Fig. 3a, we modified the input layer to receive bitemporal images separately instead of two concatenated images as done in Fig.1. To be specific, we changed the original early-fusion framework as late-fusion by constructing a Siamese network with shared weights, in which the resulting features are extracted from the fused results (i.e., subtraction) of two ''independent'' branches.

2) CHANGE ''CONV'' AS ''MBCONV'' IN THE INPUT LAYER
We used a customized CNN module to extract features from image patches. As illustrated in Fig. 3b, from 3M-CDNet to 3M-CDNet-V2, two 3 × 3 convolutions in the Input Layer were replaced by ''MBConv'' which denotes the mobile inverted bottleneck convolution [36]. MBConv is widely used in EfficientNet [36], MobileNet [38], and many other lightweight backbone networks. Comparing to its original version, here we added a residual connection, and removed the SE (Squeeze-and-Excitation)-attention layer for smaller parameters and higher accuracy. An important aspect of MBConv is that it creates an inverted bottleneck, i.e., its hidden dimension is 4 times wider than the input dimension (from C to 4C). It is reported that wider networks tend to be able to capture more fine-grained features and are easier to train [36].

3) ADD ''ASPP'' TO THE INPUT LAYER
We also added the ASPP (i.e., Atrous Spatial Pyramid Pooling) module [39] to the Input Layer. ASPP is introduced for the purpose of obtaining several multi-scale receptive fields, which may significantly improve the networks' ability to handle both large and small changes [39]. The introduction of ASPP has also deepened the Input Layer without reducing the size of the feature maps. The benefit of deeper networks is that they can capture richer, more complex [36], and highly discriminative features, and generalize well on new datasets. By dense skip connections, the output features of the Input Layer (X 0 ) will then be fed into subsequent modules to determine the change.

4) THE WIDE USE OF SKIP CONNECTION
Much evidence demonstrates that both high-level semantics and low-level detail information are important in change detection [1], so, as shown in Fig. 3a, the skip connections are widely used in our architecture to fuse the feature representations at different levels. For example, the input of Layer 2 is a fused feature representation obtained by concatenating the high-level X 1 and low-level X 0 along the channel axis, and the input of Classifier 1 is a representation obtained by fusion or concatenation of X 0 , X 1 , and X 2 . In the original version of 3M-CDNet, there is only one skip connection sending the X 1 -X 2 fused representation to Classifier, and the authors claimed that more connections seem helpless in improving the model performance [11]. However, in our case, the opposite strategy, which can take full advantage of the improved feature representation X 0 via dense skip connections, was proved to work better.

5) THE WIDE USE OF DEEP SUPERVISION
Multistage prediction and deep supervision can improve the learning ability of the feature extractors and thus help derive highly discriminative information [5]. As shown in Fig. 3a, 3M-CDNet-V2 introduces deep supervision between each ''Layer'' pair, and four deep supervision branches are used in 3M-CDNet-V2. The implementation of deep supervision is as follows: taking the Input Layer as example, first calculate X difference = |X 1 0 − X 2 0 |, where 1 and 2 represent time t 1 and t 2 , respectively; and then, use a convolution layer to reduce the dimension of X difference to 1 (i.e., the number of classes); subsequently, upsample X difference to the original image size; finally, calculate the loss between the predicted confidence map X difference and ground truth during the training process. In this way, the intermediate layers are effectively trained and weights of them can be finely updated [16], thus alleviating the presence of vanishing gradient and improving the model performance. Deep supervision strategy facilitates the network convergence during the training phase, whereas it brings about more computation and memory cost than single-head prediction [11], so the original 3M-CDNet did not apply it. However, we will prove that deep supervision is crucial in 3M-CDNet-V2.

6) THE INTRODUCTION OF MHSA IN LAYER 2
This idea is borrowed from [31]. Authors in [31] has proven that by just replacing the spatial convolutions with global multi-head self-attention (MHSA) in the final three bottleneck blocks of a ResNet and no other changes, their approach improves upon the baselines significantly on object detection and instance segmentation while also reducing the parameters, with minimal overhead in latency. In a similar fashion, as shown in Fig. 3c, we replaced the spatial convolutions with MHSA in the final three bottleneck blocks of Layer 2 which have the minimum feature-map size of H/8 and W/8. By leveraging the MHSA, Layer 2's MDConvBottleneck has larger receptive fields and is notable for its use of attention to model long-range dependencies in the data. In the original 3M-CDNet, the authors attempt to use MDConv, rather than down-sampling operations, to enlarge the receptive field of deep features. But it still has a difficulty in producing highquality classifications because it lacks the element relation modeling mechanism which is the key for the success of image-processing tasks [37]. The network architecture of MHSA used in Fig. 3c is shown in Fig. 4. Note that: in order to make the self-attention position-aware, a relative-distanceaware position-encoding [40] module which is better suited for vision tasks [41] are incorporated in MHSA.

7) DIVIDE CLASSIFIER INTO CLASSIFIER 1 AND CLASSIFIER 2
As illustrated in Fig. 3d, we divided the original Classifier module in 3M-CDNet ( Fig. 1) into Classifier 1 and Classifier 2. Therein, Classifier 1 is designed to aggregate and enhance the discriminative information contained in the feature maps (X 0 , X 1 , and X 2 ) captured by previous convolution/MHSA layers. Classifier 2 is used to generate the predicted change map based on the subtraction of X 1 3 and X 2 3 which are the bitemporal outputs of Classifier 1. Apparently, comparing to the original Classifier shown in Figure 1, this division is more compatible with the proposed Siamese architecture.

8) INTRODUCTION OF LOW-LEVEL FEATURES TO THE CLASSIFIER
It is believed that the high-level feature representations in CNN are accurate in semantics but coarse in location, and  [31]). Four heads are used in MHSA, but they are not shown on the figure for simplicity. MHSA is performed on a 2D feature map (height, width, and dimension: H×W×d) with split relative-position-encodings R h and R w for height and width, respectively. The attention logits are qk T + qr T , where q, k, r represent query, key, and the position encodings, respectively. For details of the relative position encodings, the reader is referred to [37]. on the contrary, low-level features contain fine details but lack semantic information [1]. However, the output feature representations (X 0 , X 1 , and X 2 ) of each block in 3M-CDNet all have the same size: H/4 × W/4 -a quarter the size of the input image, and the authors in [11] attempt to make these features (especially X 2 ) accurate both in location and semantics. Due to the limited resolution of the feature representations used for binary CD, this is still problematic for detecting detailed spatial information such as very small changes and narrow shaped objects, and may hurt the accuracy of boundary detection.
Inasmuch, as illustrated in Fig. 5, we added a new skip connection between τ and X 3 -both have the size of H/2 × W/2. Therein, τ is the output feature representation obtained after executing the first three layers of the Input Layer block, and it contains much finer location information than X 0 , while X 3 contains accurate semantic information. Before executing the block Classifier 2, we concatenate τ and X 3 to form a new feature representation of size H/2 × W/2 × (256+128). By doing these steps, the change detection capability of our V2 model is expected to be further improved.

9) REMOVE ''SIGMOID + BINARIZATION'' OPERATION FROM CLASSIFIER
For the sigmoid/softmax function which maps any real values to the range [0, 1] has already been embedded in our loss functions (see Section 2.3), we removed the ''sigmoid +Binarization'' operation shown in Fig. 1 from the Classifier 2 in order to avoid the vanishing gradients caused by repetitive normalization [5]. On the other hand, we reuse the sigmoid/softmax function when calculating the evaluation metrics in order to scale all the predicting data between 0 and 1. However, as the sigmoid/softmax module no longer accepts the gradient-descent update, we cannot use the commonly used threshold 0.50 [5] to perform the binary classification, and the threshold value must be manually adjusted.

C. THE LOSS FUNCTION
In the practical applications of change detection, the number of unchanged objects is often much more than that of changed ones. At the suggestion of [15], one way to alleviate the impact of sample imbalance is to introduce a hybrid and weighted loss function [5]. In this article, we consider using the combination of the Focal loss [17] and Dice loss [42]. The hybrid loss function (L) can be formulated as follows: Therein, the Dice loss (L dice ) attaches similar importance to false negatives and false positives, and it is effectively immune to the sample-imbalance issue [42].
L dice is formulated as follows: where the smoothing factor ε is an extremely small number (here ε is set as 1e-7); 0≤ y i ≤1 is the predicted value processed using the sigmoid/softmax activation function; and t i is the ground-truth value, being either 0 or 1. The L focal loss is defined as: where p t is the probability that the model predicts for the ground truth objects. Focal loss adds (1 − p t ) γ as a modulating term to the standard cross entropy loss so that it can focus learning on hard examples and prevents the large number of easy negatives from overwhelming the detector during training. When p t > 0.5, setting γ > 0 reduces the relative loss for well-classified samples, thus putting more focus on those hard and misclassified samples. What's more, L focal gives high weights (α t ) to the rare class and small weights VOLUME 10, 2022 to the dominating or common class. Here we empirically set γ = 2, and α t = 0.25 ∼ 0.45.

III. EXPERIMENTS A. RELEVANT DATA SETS
These three datasets were collected from several platforms, sensors, and areas/cities, focusing on different change type scenarios, so it is better to use them to test the applicability, robustness, and generalization ability of 3M-CDNet-V2. In addition, (1): Google Earth images and Gaofen-2 images are the preferred datasets in engineering applications because of their low cost and wide availability in the market.
(2): There are many erroneously annotated labels in PIESAT-CD and CD_Data_GZ, but this scenario is particularly common in engineering applications. From the application-oriented perspective, these are the reason why CDD, PIESAT-CD and CD_Data_GZ were selected to train and verify our model.

B. EVALUATION METRICS
The Classifier 2 module shown in Fig. 3d is a single-class object classifier, and it produces only one change map for each input image pair (the number of classes is 1). After the change maps are normalized between 0 and 1 using the sigmoid/softmax function, we need a binarization threshold defining whether a certain pixel in it is changed or not. With the purpose of maximizing the Precision and Recall at the same time and keeping a balance between them, this threshold is empirically set as 0.65 or so by trial and error.
By comparing the binarized change map and ground truth, four metrics were used to validate effectiveness and accuracy of the proposed methods, which are: Precision, Recall, F1-Score, and IOU (Intersection over Union). The formulas are given below: where TP, FP and FN denote the total pixel number of true positive, false positive, and false negative [5], respectively.

C. IMPLEMENTATION DETAILS
We implement 3M-CDNet-V2 as well as the relevant comparative models using the Pytorch framework which is open source. We conducted these experiments on a single NVIDIA Tesla V100 with 32 GB of GPU memory and train for 100 epochs to make the models converge. The relevant models were trained from scratch, and the batch size is set to 10. The AdamW optimizer is applied to train our models. The initial learning rate is set to 1e-3, and we decay the learning rate of each parameter group by gamma (= 0.770) every ten epochs. Validation is performed after each training epoch, and the best model on the validation set is saved in .pth format and used for evaluation on the test set. We also applied online data augmentation to the input image patches, including flip, normalization, rotate, gaussian blur, histogram matching, shuffle, etc.
In addition, as the CD_Data_GZ dataset is too small to fit the experimental models, so we introduced the transfer learning technology here. To be specific, before trained on CD_Data_GZ, we initialize 3M-CDNet, 3M-CDNet-V2, and several other compared models using the weights pretrained on the PIESAT-CD dataset (shown in Table 1), instead of randomly initializing them. In theory, this operation could help our neural networks reaching better and robust classification performance [5] on CD_Data_GZ, and reducing the risk of over-fitting.

D. COMPARISON AND ANALYSIS 1) THE PERFORMANCE OF THE PROPOSED MODEL
In this section, we will compare the model performance of our 3M-CDNet-V2 with several existing SOTA methods as follows: (1): FC-Siam-Diff [10] is a feature-difference method, which extracts multi-level features of bi-temporal images from a siamese UNet architecture, and their difference is used to detect changes.
(3): IFN [16] can fuse multi-scaled deep representations of bi-temporal images with image difference representations by means of attention modules for change map reconstruction. It uses pre-trained VGG16 as the backbone network.
(4): ChangeSTAR [43] is a single-temporal supervised change detection algorithm. To ensure the fairness, in this paper we still use bitemporal image pairs as its input. Note that ChangeSTAR uses pre-trained deep network (ResNet-101) as the encoder. During the training process, we set the number of iterations as 5600, and the batch size is set as 4 for input images of size 512 × 512, as 16 for images of size 256 × 256, and the SGD optimizer with momentum=0.9 and weight_decay=0.0001 is used.
(6): SiUNet3+-CD [5] is a siamese UNet3+-based CD network architecture. Note that it uses the GN (Group Normalization) layer [5] to normalize the inputs to a hidden layer, so that the batch size can be set as small as possible in conditions of insufficient computational resources.
(7): BIT [24] is a ResNet-Transformer combined architecture for CD. It applies a transformer encoder-decoder network to enhance the context-information of ResNet features via semantic tokens followed by feature differencing to obtain the change map. Note that in [24], the number of training epochs is set as 200 to achieve a relatively good performance, whereas for a fair comparison, it is set as 100 in our experiments.
(8): ChangeFormer [25] is a pure Transformer-based siamese network for change detection applications. Specifically, it uses a hierarchical Transformer encoder in a Siamese architecture with a simple MLP decoder. ChangeFormer has a pretrained model which can be downloaded from http://www.dropbox.com/s/undtrlxiz7 bkag5/ pretrained_changeformer.pt?dl=0. Note that: to make a fair comparison, here we set the training epochs of ChangeFormer as 100, rather than 200 as recommended by authors in [25].
To enable a fair comparison, whenever possible, we adopt the reported performance from the corresponding publications. Otherwise, if not specified, we re-implement the algorithms based on public codes or using our own Python codes with the parameter settings consistent with the relevant references. The same online data-augmentation strategy as above-mentioned was applied to all the compared methods.
As can be seen from Table 1: For the CDD dataset, SNUNet-CD achieves the best CD performance. 3M-CDNet-V2 achieves the second-best CD performance in terms of F1, IOU, and Recall metrics compared to other models, including the models with pretrained backbone networks, e.g., STANet, IFN, BIT, ChangeSTAR, ChangeFormer, etc. The BIT model with the pretrained ResNet as its backbone achieves the third best performance. Note that: according to [24], BIT achieves better results on CDD than Table 1 provided. This is because in [24], the BIT model has been trained for 200 epochs; while in Table 1, VOLUME 10, 2022  Table 1. the training was stopped after 100 epochs to ensure the fairness. 3M-CDNet gets the worst performance, although its batch size is set as 20 during the training process. Such favorable accuracy is not only significant in quantitative assessment reported in Table 1, but is also visually noticeable from results shown in Fig. 6. To be specific, our proposed 3M-CDNet-V2 model produces less false positives and negatives, and it successfully returns the changed areas with relatively accurate shapes/boundaries and high object internal compactness, although there are some important details still missing from the predicted maps. 3M-CDNet-V2 also has certain capability to trace small changes (e.g., row 6 in Fig. 6) or narrow shaped objects (e.g., row 2, 3, & 5 in Fig. 6). By the way, although SiUNet3+-CD has relatively low F1-score and recall, it is able to identify each small changed object with a clear and separated boundary, showing a slightly better visual performance than ours. However, its computational cost is very expensive -its FLOPs (floating point operations) come up to 216.72G [5] when the bitemporal input-image-size is 2 × 256 × 256 × 3, while the FLOPs of our V2 model are only 32.76G, thus achieving a good balance between computational efficiency and prediction accuracy.
For PIESAT-CD, 3M-CDNet-V2 still attains the best scores (F1 and IOU) compared with other nine architectures, although there are no pretrained parameters involved in the training process. IFN and 3M-CDNet get the worst performance. Although many of the PIESAT-CD image pairs are not accurately labeled, the precision index achieved by 3M-CDNet-V2 is still above the 85.0% level which is commonly recommended for industrial applications. As shown in Fig. 7, change masks generated by 3M-CDNet-V2 preserve the actual shape of changed objects with complete and separated boundaries, even when there are misregistration errors, clumped masks, and missing objects in the labels. On the other hand, the compared methods including the ones pretrained on the ImageNet dataset, e.g., STANet, ChangeFormer, ChangeSTAR, & BIT, often produce false alarms, fragmentized boundaries, and clumped masks. SiUNet3+-CD and SNUNet-CD, which have very good prediction performance on the CDD dataset, did not present a satisfactory performance on PIESAT-CD, suggesting a limited generalization ability/robustness. Another interesting observation is that although ChangeSTAR has higher prediction accuracy next only to 3M-CDNet-V2, its visual performance is sub-optimal. The case of BIT is similar, indicating it is less robust to pseudo-change and noise.
For CD_Data_GZ, IFN and STANet get the worst performance. 3M-CDNet-V2 achieves the third-best CD performance in terms of precision, recall, F1-score and IOU, next only to ChangeSTAR and SNUNet. However, as illustrated in Fig. 8, the visual performance of ChangeSTAR and SNUNet is quite poor, being vulnerable to pseudo-changes and noise.  Table 1.
Their relatively high recall values do not indicate successful detection of the targets, but lead to many false positives. In addition, there are quite a few missing objects in the ground-truth images, and ChangeSTAR & SNUNet fail to identify them. On the contrary, the visual performance of 3M-CDNet-V2 is improved significantly. As highlighted in red, our 3M-CDNet-V2 captures much finer details compared to other SOTA methods, including the small objects missed in the ground-truth images. Obviously, our proposed method could achieve a good generalization capacity on small datasets by initializing the model parameters using the parameters pretrained on PIESAT-CD, so it is greatly helpful in engineering applications.
The above results demonstrated that 3M-CDNet-V2 has excellent generalization ability and has a strong ability to counter against pseudo-changes and noise.  Table 2 below.

2) THE EFFICIENCY EVALUATION
(2): ChangeSTAR, IFN, ChangeFormer, and SiUNet3+-CD have the maximum trainable parameters, and therein, IFN and SiUNet3+-CD have the largest number of FLOPs: 82.26 G and 216.72 G. Although some of them could provide an acceptable CD performance, the expensive computational cost limits their real-world applications.
(3): FC-Siam-Diff, 3M-CDNet, and SNUNet-CD have the least number of trainable parameters, but they offer significantly lower CD accuracies in our experiments. Even worse, when the input image size is 256 × 256, the FLOPs of 3M-CDNet and SNUNet-CD come up to 23.69G and 54.77G, respectively. Thus, these cannot provide a good balance between computational complexity and performance.
(4): BIT, STANet, and 3M-CDNet-V2 have medium parameters, and their FLOPs are 10.59G, 12.98G, and 37.63G, respectively. Referring to Table 1, we conclude that BIT achieves an optimal trade-off between efficiency and accuracy, and the proposed 3M-CDNet-V2 achieves the second-or third-best trade off. Obviously, it is the MHSA module that gives 3M-CDNet-V2 a slightly better CD result while moderately increasing the computational complexity.
We also checked the training speed (time consuming) of different models, which is in general consistent with  Table 1. FLOPs. For example, if setting batch size=2, it costs 49s for FC-Siam-diff to run a forward iteration 200 times during the training phase; for ChangeFormer (with pretrained model) it costs 47s; for 3M-CDNet it costs 55s; for SNUNet-CD it costs 1 min and 30s; for our 3M-CDNet-V2 it costs 2 mins and 04s; for IFN it costs 2 mins and 12s; and for SiUNet3+-CD it costs 4 mins and 29s. Although our model training is relatively time-consuming, its inference efficiency is satisfactory. If setting batch size = 20 and the input image size = 256 × 256, it takes only 1.0s for our model to make one prediction.

3) ABLATION STUDIES -TESTING ON THE PIESAT-CD DATASET
In this section, we study the effect of removing different components to provide insights into what makes the proposed V2 model successful. We make the quantitative comparisons on the PIESAT-CD dataset. To be specific, we conduct experiments in Table 2 to ablate each component of 3M-CDNet-V2 step by step. To ensure a fair comparison, the parameter settings of different models in Table 2 are devised to be the same during training, for example, the batch size of them is fixed as 10.
2, There is a significant accuracy boost when adding deep supervision modules to 3M-CDNet (''O''). To be specific, the F1 score of ''O+SUP'' is 23.64% higher than the baseline model ''O''.
Accordingly, in Fig. 10,  learn the features of the changing objects better, with reduced incorrect detection while partly avoiding the phenomenon of ''clumping'' in the detection areas. 3M-CDNet-V2 gives the optimal visual performance: complete boundaries, few clumped masks, and fewer false detections. This is because the introduction of ''τ '' guides the network to learn more detailed difference information between the bi-temporal images.
From the above, we can conclude that our modifications push the baseline models up to new state-of-the-art records, and trading the improved performance with a certain increase in computation costs is therefore worthwhile.

IV. DISCUSSION
In our opinion, there are eight meaningful aspects around the proposed network architecture worth to discuss, which are listed in the following paragraphs.
First, among all the modifications we made to 3M-CDNet, the late fusion framework and deep supervision have been proven to contribute most to the performance gain, followed by ''SC'' and MHSA. However, authors in [11] argue that: without deep supervision and the connections skipped from the Input Layer block, 3M-CDNet still outperforms other SOTA approaches in terms of the comprehensive metrics F1 score and IOU by a significant margin. This conclusion is not supported by the outcome of our study, and as shown in Table 1 and 2, 3M-CDNet has the worst performance among the compared methods. As the source code of 3M-CDNet in [11] is not released to public, we cannot obtain the specific designs and implementation details of this model from the perspective of programming, but one thing is certain: even though 3M-CDNet could achieve upwards to 90% prediction accuracy in CD tasks, further modifications on it are still possible and necessary.
Second, at the center of 3M-CDNet is the high-resolution feature extraction and analysis, which enables the network to reduce the loss of information during the process of downsampling feature learning. Relative to 3M-CDNet, our model goes deeper: it keeps using high-resolution feature maps in the model training and prediction by incorporating the skip connection between τ and the classifier, which is not only helpful for further improving the CD accuracy metrics, but also has the ability to improve the visual performance of the predicted results, namely promoting the detection of detailed spatial changes and object boundaries.
Third, people may also question that: due to the limited performance gains, why are MBConv and ASPP essential to 3M-CDNet-V2? Actually, before the introduction of ''SC'', VOLUME 10, 2022   it is Layer I and Layer II that contribute most to the final results, and thus the accuracy gains obtained by MBConv/ ASPP are not impressive. However, after the introduction of ''SC'', the features extracted by MBConv and ASPP become part of the input of Layer I, Layer II, and Classifier 1, and then we got substantial accuracy gains. Authors in [11] found that the connections skipped from the Input Layer block have nearly no improvement in the CD performance, but when MBConv + ASPP were added, the skip connections begun to show its effects. This explains why MBConv and ASPP are essential to 3M-CDNet-V2.
Fourth, MDConv is of an efficient mechanism to enhance the geometric transformation modeling ability [11] of deep feature representations, and it can also increase the receptive fields of them. However, as pointed out by [44], MDConv lacks the element relation modeling mechanism, which is the key for the success of a CD network architecture. Thus, in 3M-CDNet-V2, MHSA was incorporated into the backbone network immediately after the MDConvBottleneck module. This strategy brings two main benefits: 1, further enlarging the receptive field of deep features; and 2, modeling long range dependencies. Authors in [31] observed a significant increase in prediction accuracy by just replacing the spatial convolutions with MHSA in the final three bottleneck blocks of a ResNet, but we did not -the accuracy gains obtained by MHSA in our model is moderate. In fact, most investigations in the literature [7], [26], [29] show that the Transformer-based architectures could achieve comparable or slightly better results than their CNN counterparts, so the performance of MHSA in our experiment meets the expectations, although its application involves large computational costs. It can be seen from Fig. 10 that the introduction of MHSA greatly improves the visual quality of CD maps, so it is not optional but essential for achieving high accuracy. In addition, at the suggestion of [7], long training epochs is usually necessary for the MHSA weights to be learned to focus on the changing objects of interest. Our experiments also demonstrated that when increasing the training epochs from 100 to 200, 3M-CDNet-V2 achieves considerable gains (>2.0%) in terms of IOU, F1 score, and precision over the baseline model on the PIESAT-CD dataset. This reminds us that further training may be helpful in further improving the model performance in identifying changes, so the potential of our model might be underestimated.
Fifth, the method proposed in this research is very effective for the dynamic monitoring of changing objects at different scales, but it can be further optimized in the following aspects.
(1), as shown in Fig. 3, we use the subtraction operation to fuse the bitemporal feature representations, but in many cases, direct concatenation might be a slightly better choice for feature fusion [45]. (2), it has been demonstrated that our model has good generalization ability and transferability from PIESAT-CD to CD_Data_GZ, but did not generalize well from CDD to CD_Data_GZ. CD_Data_GZ differs from CDD in terms of sensor type and definitions of change. For the purpose of engineering application, it is worthwhile to create a general pretrain model using sufficient training samples [5] covering different sensor types, changing objects, and imaging conditions. (3), one ongoing challenge in the use of 3M-CDNet-V2 for CD is how to automatically set the threshold for binary ''change versus non-change'' classification. If the threshold value is large, precision is higher and recall is lower, and vice versa. In this article, the optimal threshold values (i.e., 0.65 or so) were determined through a process of trial and error. In the future, we plan to modify the Classifier 2 module in Fig. 3d as a two-class object classifier, so that the threshold values can be determined automatically by the ''argmax'' function in Python. (4), MDConv is of an efficient mechanism to attend to sparse spatial locations, and MHSA has powerful element-relation-modeling capacity, so the combination of them in 3M-CDNet-V2 yielded excellent CD results. However, Xia et al. [37] believed that it is possible and necessary to do both in a singular architecture, and a novel deformable self-attention transformer (DAT) was put forward. DAT provides a flexible scheme enabling the self-attention module to selectively focus on important regions and capture more informative deep features. Reference [44] did similar work in the field of object detection. For future research, we plan to simplify the 3M-CDNet-V2 architecture by directly replacing the MDConv-MHSA combination with DAT, so as to make the model more lightweight, more robust, easier to use, and cheaper.
Sixth, as the core part of Transformers, MHSA is believed to be not suitable for small sets of data [26], while we note that the introduction of transfer learning is necessary. However, unlike the classic transfer strategy, it is better to initialize all the parameters of 3M-CDNet-V2 using the weights pretrained on big datasets and then fine-tune them, rather than freezing and only initializing the encoder part. This is an important finding of our study, and it allows for a MHSA-based model to be trained on very small datasets like CD_data_GZ.
Seventh, as the F1 score vs. efficiency of our model is not better than some existing methods such as BIT, so seemingly, the only contribution of our model is to detect narrow and little changes. But in a strict sense, we may need to collect more evidence or conduct more experiments to support this claim against other existing methods. Actually, we accept that our 3M-CDNet-V2 may not be the optimal solution for change detection, but it does provide competitive or slightly better accuracy metrics or shows better visual quality (than the compared methods including BIT) on three known datasets. This fact confirms the benefits of the modifications we made to 3M-CDNet, and in particular, it improves our understanding of architecture design spaces for deep supervision, late fusion, MHSA, and τ in RS change detection. In other words, with improved detection sensitivity, our model is a step forward in designing an effective end-to-end change detection technique for bitemporal high-resolution image analysis, and this is the most important contribution of the article. We also believe that, due to the incorporation of MHSA and its power at modeling global contexts [7], the detection sensitivity and accuracy of our model on the new dataset or in engineering applications could be guaranteed. As for the computational efficiency, we accept that our model is inefficient relative to some of the compared methods, but this is not a major concern in practical applications because our study has demonstrated that 3M-CDNet-V2 has superior transferability for downstream tasks under large-scale pre-training. Given the above, the current study makes meaningful contributions to the knowledge on RS change detection.
By the way, although our 3M-CDNet-V2 have achieved satisfactory performance, it requires an enormous amount of training data, which can cost considerable expense and time. To ease the effort of acquiring high-quality annotation data, in recent years, many studies have focused on unsupervised or self-supervised CD algorithms, for example the style transformation-based spatial-spectral feature learning for unsupervised CD [46], as well as the task-related selfsupervised learning for CD, i.e., the discriminative adversarial deep neural networks (DADNN) proposed in [47]. The performance of these algorithms is even comparable to that of the state-of-the-art supervised or semi-supervised methods. In our case, unsupervised 3M-CDNet-V2 might not work in theory because MHSA is data-hungry, but it is feasible to equip 3M-CDNet-V2 or its variants with a self-supervised preprocessing module such as DADNN. In a follow-up study, we are planning to train a similar architecture on different CD datasets to facilitate its engineering application.

V. CONCLUSION
To satisfy the needs of rapid mapping of change information in high resolution RS imagery, an efficient medium-weight network architecture was proposed for change detection. VOLUME 10, 2022 This architecture was developed as an updated version of 3M-CDNet by combining CNN and MHSA into a singular network, termed 3M-CDNet-V2.
Using three engineering CD datasets (CDD, PIESAT-CD and CD_data_GZ) as benchmark, a series of comparative experiments and ablation experiments were conducted, and our findings showed that: (1) relative to other state-of-theart algorithms, the proposed algorithm obtained competitive or slightly better CD performance on both visual comparison and quantitative metrics evaluation in terms of F1-Score and IOU. (2) By applying a novel transfer learning strategy, 3M-CDNet-V2 could perform well on the small dataset.
(3) 3M-CDNet-V2 maintains a suboptimal trade-off between computational complexity/time and performance. (4) The late fusion framework and deep supervision contribute most to the performance gain of 3M-CDNet-V2, followed by the skip connection and MHSA. In particular, the introduction of MHSA could reinforce the performance of 3M-CDNet while moderately increasing the computational cost. (5) The introduction of the skip connection linking τ to the classifier gives 3M-CDNet-V2 a better visual performance, so that it can capture detailed spatial changes such as small changes, narrow shaped objects, and accurate boundaries.