Domain Knowledge-Guided Self-Supervised Change Detection for Remote Sensing Images

As one of the most popular topics in the field of Earth observation using remote sensing images, change detection (CD) provides great practical and valuable significance for many fields. Although the majority of supervised methods have made great progress by introducing deep learning in the CD field, they are still limited by manually labeled data. In comparison, unsupervised methods do not require manually labeled data, but the accuracy of CD is difficult to be improved due to the lack of constraints or guidance during training. To tackle these issues, we propose a novel domain knowledge-guided self-supervised learning approach for unsupervised CD by fusing the domain knowledge of remote sensing indices during training and inference. Furthermore, we calculate cosine similarity to select the high-similarity feature vectors outputted by the mean teacher and student networks to implement the hard negative sampling strategy, which effectively improves the CD performance. Compared with other supervised and unsupervised CD methods, our proposed approach achieves state-of-the-art performance with a Kap of 53.34% and an F1 of 55.69% on the Onera Satellite Change Detection dataset. Fusing domain knowledge to guide model training and inference obtains an improvement of 5.83% in Kap and 5.13% in F1, which further narrows the performance gap between unsupervised and supervised CD.


I. INTRODUCTION
W ITH the rapid development of society and economy, human activities have inevitably resulted in continuous changes in the ground surface. The detection of ground surface change is called change detection (CD) using bitemporal remote sensing images that cover the same geographical area. The CD technology with remote sensing images has gradually become the main way for humans to discover ground surface changes. This change information can provide great practical significance for many applications including ecological protection, natural resource management, and urban governance and planning [1], [2], [3]. Today, many supervised methods give better results and are more common in the literature. However, the supervised methods rely on the quality and number of manually labeled data for training models, this need for manually labeled data can be a problem with automated applications for which no such data or too little data are available. Therefore, the supervised methods may be impractical for some practical applications. Unsupervised methods differ from supervised ones by not requiring manually labeled data. In some unsupervised methods, the lack of supervised information as guidance or constraint often leads to poor CD accuracy. For improving the performance of the unsupervised method, it is challenging to automatically extract supervised information from unlabeled data and close the performance gap between the supervised and unsupervised methods. In the past decades, the conventional unsupervised methods based on algebraic operations directly compare pixel values of bitemporal images, among which change vector analysis (CVA) is one of the most famous [4]. Besides, the image transformation algorithm projects the original features of the image into a specific feature space to better separate the changed and unchanged pixels, such as multivariate alteration detection (MAD) [5], iterative reweighted MAD (IRMAD) [5], iterative SFA (ISFA) [6], and deep SFA (DSFA) [7]. The kernel principle component analysis (KPCA) is designed as an unsupervised deep siamese KPCA convolutional mapping network (KPCA-MNet) for the CD task [8]. These conventional unsupervised CD methods are highly sensitive to noise in the data, which leads to false detection and inaccurate result.
Recently, deep learning has been developed and has powerful feature extraction abilities in the CD field [9]. Among them, the majority of deep-learning-based methods aim to produce a feature space by mapping data to detect changes. The superiority of deep learning has motivated this article using deep learning to perform unsupervised CD. Many deep-learning-based studies have proposed various unsupervised CD methods for different problems without manually labeled data. For example, Dong et al. [10] proposed an unsupervised CD method in synthetic aperture radar (SAR) images using self-attention deep clustering based on octave convolution, achieving high performance. Saha et al. [11] used the trained auxiliary classifier generative adversarial networks (ACGAN) as the discriminators to effectively model contextual information and handle the multispectral images with a large number of bands. Zheng et al. [12] proposed a novel method base on an autoencoder to detect change between multitemporal images with different resolutions, without image resizing operations This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ in the preprocessing step. Kalinicheva et al. [13] proposed an end-to-end unsupervised approach based on the reconstruction losses of joint autoencoders (RLJA) to detect changes that do not follow common tendencies. However, the reconstruction models with autoencoder overemphasize pixels rather than interesting feature representations, which makes challenging to further improve the CD accuracy.
Another deep-learning method, self-supervised learning can autonomously extract supervised information from a large amount of unlabeled data by representation learning, replacing manually labeled data to train models. The method is an unsupervised method without manually labeled data, which helps to solve all of the above problems to a great extent [14]. Self-supervised learning is a new way of machine learning and is one of the hottest topics in the deep-learning field [15]. The self-supervised learning methods are generally divided into generative and contrastive methods [16]. The generative methods include image restoration [17], [18] and image coloring [19], [20]. The method performs representation learning by restoring casually damaged images. The model has learned the key features that can describe the original image when it can completely restore the original image. Currently, contrastive self-supervised learning has been used to learn interpretable and meaningful feature representation, which effectively improves the performance of classification and segmentation tasks [15]. In contrastive self-supervised learning, the key challenge is that constructing reasonable positive sample pairs and collecting hard negative sample pairs must remain unsupervised [21], [22]. The basic principle of the contrastive method is to reduce the feature representation distance between different views of the same image and increases the view representation distance between different images. The distance between sample pairs is measured and constrained to train the network. Therefore, the contrastive learning method can effectively learn both the invariant and distinguishable feature representation, improving the model performance [23]. Enlightened by the above studies, we design a novel self-supervised learning network architecture for representation learning to accomplish unsupervised CD.
Deep learning has achieved great success in computer vision, but it suffers from some performance degradation when applied to remote sensing. The nature of deep learning is data-driven way, and insufficient training data will limit their performance. Fortunately, domain knowledge can be integrated into the learning process to guide training and produce a trustworthy feature representation model [24]. Rare research has combined this domain knowledge with deep learning in the CD task. As a result of this continuity of land cover change in time and space, studies have shown that most of the features of land cover categories can be quantitatively expressed with domain knowledge [25]. In this article, we select four remote sensing indices (RSIs) with highly generalizable, including the normalized difference vegetation index (NDVI), normalized difference water index (NDWI), normalized difference bare surface index (NDBSI) [24], and shadow enhancement index (SEI) [26]. The RSIs have been widely used to detect land use change and can effectively highlight ground object features [27], [28], thereby providing valuable domain knowledge for the CD task. The domain knowledge in the RSIs can be extracted through self-supervised representation learning to guide the model training and inference, which can effectively improve the CD accuracy.
The main contributions of this article are as follows: 1) We propose a novel domain knowledge-guided selfsupervised change detection approach (DK-SSCD). Compared with other CD methods, our proposed approach achieves the best performance and closes the gap between the unsupervised and supervised CD methods. The mean teacher network is used to implement CD during inference, which improves the CD accuracy and reduces the drastic fluctuation of accuracy. 2) Inspired by the hard negative sampling strategy, we propose a feature selection method. We calculate cosine similarity to select the high-similarity feature vectors outputted by the mean teacher and student networks, which further improves the CD accuracy. 3) We propose a fusion method for fusing domain knowledge in the RSIs, which helps guide the mode training and inference to improve model performance. The rest of this article is organized as follows. Section II presents the related works. Section III elaborates on our proposed approach and evaluation metrics. Section IV reports the experimental results of comparison with other methods and ablation studies. Section V provides a further discussion of domain knowledge and critical parameters. Finally, Section VI concludes the article.

II. RELATED WORK
In this section, we review the relevant works in terms of the aforementioned aspects, including prior knowledge-based deep learning and self-supervised CD.

A. Prior Knowledge-Based Deep Learning
The fully supervised deep-learning-based method is heavily dependent on the manually labeled data, which is also a common case in the CD field using remote sensing images [3]. Making the labeled data for the CD task is more labor-intensive and timeconsuming than other tasks and requires help from professionals. For sufficient change samples, we not only consider the changes between the bitemporal images but also select the areas with significant changes. Although the unsupervised method does not need manually labeled data, the lack of prior knowledge as guidance or constraint makes it difficult to extract important semantic information, resulting in poor performance. When the manually labeled data are insufficient, some studies integrated prior knowledge into the learning process to guide training and generate a trustworthy derived model [29]. The recent growth of research has introduced prior knowledge into the training process, and the combined data-driven and knowledge-guided methods effectively improve the model performance. Li et al. [30] proposed a two-stage model that can learn representations from a lot of unlabeled remote sensing images and geographical knowledge. Martinez et al. [31] proposed a postprocessing method to enforce prior knowledge based on the probabilities computed in the first step. However, these knowledge-integrated methods are very complicated in the training process, and the interaction between the results of each stage restricts the model performance. Other studies utilized prior knowledge of particular sites within the study area [32], but the prior knowledge is only applicable to the study area and does not guarantee generalizability. Therefore, the methods cannot be extended to other study fields or regions. In addition, prior knowledge is usually unavailable in many regions, so it is necessary to retrieve or extract knowledge from another data source. Other researchers selected released GlobLand30 [33] maps as prior knowledge [34]. However, the data have some noisy labels that are irrelevant task information, which may cause instability during training.
On the other hand, previous articles have proposed that weakly and semi-supervised methods use inaccurate or insufficient labeled data as a priori knowledge to improve model performance. Shaunak et al. [35] proposed a stacked autoencoder network to extract the multitemporal polarimetric information of SAR images, and then used label aggregation and supervised multilayer perceptron (MLP) to detect change. Daudt et al. [36] proposed an iterative learning method by extracting useful information from open vector data with inaccurate and unreliable ground truth semantic information. The coverage area of this open data limits the applicable area of the model. The two methods are weakly supervised methods, which can use unreliable or incomplete labeled data to train models. However, we still need to seek or make the special labeled data. Connors et al. [37] constructed a deep generative model to predict class labels without incurring high sample complexity costs, but the method still needs a small set of labeled data. Hedjam et al. [38] proposed a method to generate the impostor patch-pairs sample for CD by matching the image before change with other images from the Internet. But the image textures from the Internet need to be similar to the image after change, which causes preparing the training dataset to be complicated and requires precise knowledge about changed and unchanged areas. Li et al. [39] adopted semi-supervised learning techniques into a deep network to use partially labeled data for training. The three methods are the semi-supervised method and require some labeled data and unlabeled data for training. The unlabeled data do not need manual labeling information, but the distribution features of the unlabeled data are required to be related to the labeled data. Otherwise, the unlabeled data will wrongly guide model training and weaken model performance [15], [16]. Overall, the weakly and semi-supervised methods are not fully unsupervised, also requiring special data for training. To avoid these problems, we study a succinct unsupervised method base on self-supervised learning for the CD task, effectively integrating domain knowledge into the model to guide training and inference.

B. Self-Supervised Change Detection
In the computer vision field, there is a large body of work on self-supervised learning, focusing on classification and semantic segmentation [21], [40], [41]. The self-supervised approach can train the model with unlabeled data to achieve representative learning, and then the trained model is used to perform downstream tasks. Chen and Bruzzone [1] proposed a self-supervised CD approach based on multiview contrastive loss [42]. Saha et al. [14] proposed a self-supervised approach for the CD tasks in the optical and SAR sensor bitemporal images by absorbing deep clustering, augmented view, Siamese network, and contrastive learning concepts from the recent literature. These methods can be successfully applied to the CD task, thanks to the contrastive loss which can effectively deal with the feature information of paired samples in the Siamese network to separate the unchanged and changed information. The key in contrastive self-supervised learning is how to extract hard negative samples that are difficult to distinguish from an anchor point in an unsupervised way [43]. The above self-supervised CD approaches do not fuse the domain knowledge to guide model training and inference, and they have no regard for the effects of hard negative samples. Therefore, these approaches cannot extract well features, and it is difficult to further improve the CD accuracy. In this article, we propose a novel self-supervised CD approach that effectively fuses RSIs during training and inference to integrate domain knowledge into the model, improving model performance. The mean teacher network is used to implement CD during inference, which effectively improves the CD accuracy and reduces the drastic fluctuation of accuracy. Besides, we proposed a feature selection method inspired by the hard negative sampling strategy. The similarity between the feature vectors outputted by the mean teacher and student networks is calculated using the cosine similarity function during training. Then, we select the high-similarity feature vectors to calculate the contrastive loss for representation learning. The method not only reduces computation but also improves the CD accuracy.

A. Remote Sensing Index
The RSI is a remote sensing technology that enhances spectral characteristics by combining different bands in multispectral images to highlight the characteristics of certain ground features. Today, this technology with high generalization is mostly used for land cover classification. Different ground features have the corresponding index, which can be used as auxiliary data for classification to improve accuracy. Li et al. [24] applied multimodal RSIs as domain knowledge to classify land cover and effectively improve the model performance. Madasa et al. [27] used five RSIs to discriminate different land cover categories and monitor land use changes in mining areas. Huang et al. [28] proposed an automatic CD method to monitor newly constructed building areas by calculating the planar and vertical features of multiview remote sensing images using specific RSIs. In addition, Chen et al. [40] proposed that multiple data are better at improving the performance of contrastive learning than single data. The NDVI [44] is one of the commonly used index to differentiate between vegetation and other land cover categories [45]. It is also reported with superiority in sensitivity to vegetation structure, like vegetation height and coverage [46], [47], which may benefit the differentiation between forest, grassland, and cropland. Thanks to the high responsibility to open water of the NDWI [48], it is often used in separation between wetlands and water [49], [50], [51], [52]. However, NDWI is difficult to distinguish between water and shadow. Sun et al. [26] proposed a new building shadow detection index by combining SEI, NDWI, and NIR to separate various spectrally similar objects. The SEI can enlarge the difference between shadow and water, which helps distinguish water and shadow. In urban areas, the land cover before change generally belongs to vegetation, water, or wetlands, so these RSIs help improve CD accuracy by emphasizing the characteristics of the land use category before the change. The NDBSI well performs in bare surface detection, which is the most common subclass of unused land [51]. Generally, changes caused by human construction activities are more obvious. At a certain period in the change process, the land cover category usually belongs to the bare surface. In other words, the bare surface in the image is likely to belong to the changing area.
Therefore, we select four RSIs (Table I) that can effectively highlight the ground feature, which frequently appears in the change process. These RSIs feature vectors are fed into the model as other data. The domain knowledge in RSIs can guide our proposed approach during training and inference by capitalizing on the different benefits of bands, and experiments demonstrate that they effectively improve the CD accuracy.
Sentinel-2 images include four bands (B2, B3, B4, and B8) at 10-m spatial resolution, six bands (B5, B6, B7, B8a, B11, and B12) at 20-m spatial resolution, and three bands (B1, B9, and B10) at 60-m spatial resolution. The four bands at 10-m spatial resolution are used to calculate the NDVI, NDWI, and NDBSI. Because the spatial resolution is different when calculating the SEI, we utilize the resampling method of the nearest neighbor to generate the 10-m spatial resolution B1 and B9 bands. The four bands at 10-m spatial resolution and the four RSIs feature vectors are, respectively, concatenated as four-channel data for training and inference.

B. Representation Learning
In this section, the main contributions include the following parts. The expansion patch module (EPM) is designed before the backbone network to expand the input patch size and improve the CD accuracy. The weights of the mean teacher network are updated by the exponential moving average (EMA) of the student network, rather than sharing the weights of the student network directly. In addition, we select the high-similarity feature vectors outputted by the mean teacher and student networks to calculate contrastive loss, improving the CD accuracy. Fig. 1 shows the overview of representation learning of the DK-SSCD approach. Initially, the RSIs feature vectors with domain knowledge and images are cropped into the 8 × 8 patch with 4 × 4 stride. Then, these patches of the bitemporal images and the RSIs feature vectors are separately fed into the network architecture. We select the high-similarity feature vectors outputted by the mean teacher and student networks by calculating cosine similarity, and then calculating contrastive losses. Finally, our model is trained to minimize the weighted sum of the two losses for representation learning.
1) Network Architecture: Our network consists of an EPM, a backbone, and a projection head. The small patch is directly input into the backbone network, which makes it difficult for the backbone network to extract valuable features [53], [54]. However, the detection result with a larger patch size is too rough, which dramatically reduces the CD accuracy. To solve this contradictory problem, we design the EPM which consists of a two-layer dilated deconvolution and batch normalization (BN), as shown in Fig. 1(c). The dilated deconvolution is a special convolution, which focuses on restoring or expanding the feature size [55]. Compared with normal convolution, the dilated convolution has a larger receptive field, which helps extract the key information of the patch. Therefore, the EPM is beneficial for the backbone network to extract important features and improve the model performance. For best model performance, we tested dilated convolution hyperparameters and set the kernel size of 3, stride of 2, padding of 1, output padding of 1, and dilation of 2.
Previous articles have shown that the receptive field of the fully convolutional network is limited, and global information can be obtained by designing a multilayer stacked network [54], [56], [57]. However, as the number of network layers increases, the feature information will be exhausted, and the feature attention focuses on local areas. The self-attention in the transformer can mine long-range correlation dependencies, which can effectively extract global information [54], [58]. And the multihead attention mechanism can project global information to multiple spaces, improving the feature extraction capabilities. In the CD field, the transformer-based networks can attend to different regions of the image and integrate information across the entire image, providing good results [59], [60]. In addition, previous articles have shown that vision transformers can significantly improve performance in self-supervised frameworks [54]. And, Chen et al. [61] proved that the vision transformer outperforms ResNet [53] even without pretraining or strong data augmentation. Caron et al. [62] used vision transformer and self-distillation to work particularly well for the classification task. Therefore, we use the vision transformers as the backbone network of the mean teacher and student networks to exact changed and unchanged features of input data. The projection head is composed of a three-layer MLP with 2048-dimension hidden, followed by 2 normalization and a linear layer [63] [ Fig. 1(d)]. We perform the ablation experiment on the projection head and EPM, this particular design appears to work best for CD.  We design the student and mean teacher network, which can be trained with unlabeled data for the CD tasks. The mean teacher network is stopped propagating gradients by the stopgradient operator, and the propagating gradients are only run through the student network. The weights of the mean teacher network are updated by the EMA weight of the student network [64]. The update rule is θ t = λ θ t−1 + (1 − λ)θ s , with λ following a cosine schedule from 0.996 to 1 during training.
2) Calculating Loss: Contrastive learning is one of the most popular feature representation learning methodologies in selfsupervised learning. The key idea of contrastive self-supervised learning is to reduce the feature representation distance between different views of the same image. The CD technology uses bitemporal images in the same area to detect the ground surface change, the bitemporal images can also be used as different views of the same image [1]. Therefore, contrastive selfsupervised learning can learn both the invariant and distinguishable feature representation during representation learning [23]. And the approach can effectively highlight the change between bitemporal images and accomplish representation learning for unsupervised CD [14], [65]. Fig. 2 shows the calculating loss process of Fig. 1(b) in the DK-SSCD approach. By calculating cosine similarity, we select the high-similarity feature vectors outputted by the mean teacher and student networks. And then we use the hard negative sampling strategy to address the difficulty of negative sampling [43]. Finally, the positive and negative samples are used to calculate the contrastive loss.
In the classification and object detection field, many studies simply augment the object region of the original image as positive samples, and the background region is used as negative samples [15], [40]. This method may lead to a problem. The negative samples with long feature distances from the positive samples are easily separate, but the negative samples with close feature distances are difficultly distinguished. A random negative sampling method is widely used in natural language processing [66] and computer vision tasks [56]. The method may select a large number of negative samples that are very similar to each other, resulting in a lack of diversity in the training samples. The adversarial negative sampling method generates synthetic negative samples using generative adversarial networks [67], but the method adds complexity and requires more computational resources during training. However, the basic idea of the hard negative sampling strategy is that the feature distances between the negative samples are very close to the object but not belonging to the object category as hard negative samples. The advantage of this strategy is that it increases the requirements for identifying negative samples, which helps separate the feature between different categories to improve the model performance [43]. Samples are weighted according to the size of the dot product calculated by all negative samples. The weight of negative samples with high similarity to positive samples is larger and the weight of low similarity is smaller, resulting in the model can more accurately distinguish hard negative samples. Therefore, we apply the hard negative sampling strategy and use high-similarity feature vectors to calculate the contrastive loss, which is beneficial for contrastive learning of visual representations [43], [68], [69]. Our experiments show that the high-similarity feature vectors after feature selection help improve the CD accuracy. First, we calculate the cosine similarity between the feature vectors outputted by the mean teacher and student networks by formula (1). And we select the features with a similarity greater than 0.5. Cosine similarity utilizes the cosine value of the angle between two vectors as a measure of the similarity between the two vectors. Compared with Euclidean distance similarity, cosine similarity pays more attention to the difference in direction of two vectors, which helps to highlight the feature distance [42]. The formula can be expressed as where P t and P s are the feature vectors outputted by the mean teacher and student networks, respectively. After feature selection, the concatenation operations are used to integrate the P s and P t , generating the P st . Then, we use the formula (2) and the negative sample mask to generate the negative samples (Ng). According to the algorithm of matrix multiplication, the neg matrix can be divided into four regions for understanding. As shown in Fig. 2, the 1 region in neg matrix is generated by multiplying P s and P T s , similarly, the 4 region is generated by multiplying P t and P T t . The 2 and 3 regions are generated by multiplying P s and P t . The blue grid location in the negative sample mask denotes the diagonal location of the four regions, the values at these locations are generated by summing the multiplication of the same location values of P s and P t . The values at the red grid location are generated by summing the multiplication of the different location values of P s and P t . Therefore, we select the value of the neg matrix at the red grid location in the negative sample mask as the negative samples [1], [42], [43] where τ is the temperature, which is set as 0.07 by following previous work [42]. * denotes matrix multiplication operation.
Meanwhile, we generate the positive samples by the following formula, which is a similarity function to model the feature distance between different dimensionality of the model head output: Robinson et al. [43] proposed an effective negative sampling strategy to avoid the problem of sampling bias caused by false-negative samples with the same context information as the anchor. The feature representation in contrastive learning benefits from the hard negative samples which are ones that the embedding believes to be similar to the anchor. We use the method to address the difficulty of negative sampling in the CD task during training. The method is defined as where P is the class probability which is set 0.1 and β = 1.0 is the concentration parameter [43]. Finally, we calculate the contrastive loss for training by formula (6). The contrastive loss aims to extract feature representations that can represent the feature distance between different samples

C. Optimizing Performance With Domain Knowledge
Our proposed domain knowledge fusion method is simple and effective. During representation learning, we use the weighted loss to fuse domain knowledge, which can guide the mode training and effectively improve the model performance. And we use the weighted regression errors during inference to fuse domain knowledge and generate the change intensity map (CIM), which effectively improves the CD accuracy.
1) Guiding Training: The patch pairs of bitemporal RSIs feature vectors with domain knowledge and images are, respectively, fed into the network architecture to calculate the contrastive loss. As shown in Fig. 1(b), we use the weighted loss with domain knowledge and image feature information to guide representation learning. Our model is trained to minimize the loss defined as where L DK contrast and L I contrast are the contrastive loss values using the RSIs feature vectors and images as input data, respectively. The λ 1 and λ 2 are the weight of the contrastive loss of the L I contrast and L DK contrast , respectively.

2) Inference for Change Detection:
The network architecture of the DK-SSCD approach is designed based on the mean Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. teacher self-distillation, which effectively improves the CD accuracy during inference. The main reason for this improvement is that the self-distillation method can be viewed as implicitly combining the merit of the ensemble and knowledge distillation methods [70]. The ensemble method is one of the oldest and most powerful techniques in practice, also known as model averaging [71]. Simply averaging the output of several independently trained networks on the same dataset can achieve higher performance than the original model. However, the ensemble method has a long test time and a large amount of calculation. To solve this problem, a knowledge distillation method is proposed that only trains another individual model to match the output of the ensemble method [72], [73]. During inference, the bitemporal RSIs feature vectors with domain knowledge and images are cropped into 8 × 8 patch with 1 × 1 stride. As shown in Fig. 3, we fed these patches into the mean teacher network to predict the feature representation of images (FR I ) and domain knowledge (FR DK ). Then, we calculate the regression errors of images (RE I ) and domain knowledge (RE DK ) by formula (8), and weighted sum them as the CIM by formula (9). The regression error of each patch pair is calculated in the corresponding position, which indicates the distances of the bitemporal feature representation. The CIM with domain knowledge can better represent the change probability. Finally, we use the Otsu threshold method to classify changed and unchanged pixels [74], generating the change map with the same size as the input image where FR 1 and FR 2 denotes the bitemporal feature representation. RE DK and RE I denote the regression errors using the RSIs feature vectors with domain knowledge and images as input data, respectively. To make domain knowledge play the same role during training and inference, we set λ 1 and λ 2 to the same value as in formula (7).

A. Datasets Description and Implementation Details
The Onera Satellite Change Detection (OSCD) dataset [75] is used to evaluate the performance of the DK-SSCD approach. The dataset was created to detect change using bitemporal Sentinel-2 images acquired between 2015 and 2018. The pixelwise ground truth labels are manually annotated to obtain the corresponding change map between bitemporal images, but the limitation at 10-m spatial resolution causes some errors. At the original supervised setting, 14 pairs are selected as the training set and the remaining 10 pairs are used as the test set. Our approach does not need the training set. Therefore, the CD accuracy is evaluated on the 10 test image pairs as in [1], [11], and [75] that provide accuracy metrics jointly considering all the pixels from the test set.
The DK-SSCD approach is implemented on Pytorch. The model is trained by the AdamW optimizer, and the initial learning rate is set to 0.0005 and linearly decays. We tested all hyperparameters in the DK-SSCD approach, and the patch size and unfold strider have the greatest impact on the results. The patch size is set to 8 × 8 and the unfold strider during training and inference is set to 4 × 4 and 1 × 1, which achieves the best performance.

B. Quantitative Evaluation
In this article, the quantitative evaluation metrics include precision (Pre), recall (Rec), overall accuracy (OA), Cohen's kappa score (Kap), and F1-score. The higher Pre means fewer false alarms, and the higher Rec denotes a lower rate of incorrect detections. However, Pre, Rec, and OA metrics will overestimate the result when the amount of changed pixels is a small fraction of the image, resulting in misleading. Kap and F1 is a comprehensive quality evaluation metric combining Pre and Rec, which can overcome the problem of negative correlation between Pre and Rec. Large Kap and F1 values denote better performance. The five metrics are defined as where TP is true positive which denotes the number of correctly classified as changed pixels and TN is true negative which denotes the number of correctly classified as unchanged pixels. The number of misclassified as changed pixels is represented by FP (false positive), and the number of misclassified as unchanged pixels is represented by FN (false negative). Table II shows the quantitative results of unsupervised and supervised methods, the unsupervised methods are divided into deep learning, autoencoders, and convention. The quantitative results of ACGAN, CAA, GLRT, Log-ratio, and supervised  [1] in the table. Among all unsupervised methods, the DK-SSCD approach with a Kap of 53.34% and an F1 of 55.69% achieves the best performance. The RLJA method obtain the second-best performance, followed by the Chen Prop. SSL method. In the article [1], the authors augment image pairs during training by downloading additional Sentinel-2 images in the same location as the original bitemporal images from each month between 2016 and 2020. Although the data augmentation method achieves an improvement of 2.87% in F1 in the article [1], the CD accuracy is still worse than our approach. In this article, we only use bitemporal images from the OSCD data for self-supervised learning. The performance of DSFA is the worst among unsupervised deep-learning methods. The method employs CVA to per-detect unchanged pixels as training samples. These samples have false detection, resulting in worse and worse performance. In addition, when the RSIs feature vectors are not adopted during training and inference, the DK-SSCD approach achieves a Kap of 47.51% and an F1 of 50.56%. The performance of the DK-SSCD approach without domain knowledge is still better than other methods.

C. Result of Comparison Method
Compared to the supervised methods, the DK-SSCD approach obtains performance better than the FC-EF network. Daudt et al. [9] proposed an encoder-decoder network with skip connections by modifying the FC-EF network to apply residual blocks (FC-EF-Res). The FC-EF-Res network just outperforms the DK-SSCD approach by 2.22% in Kap and 3.51% in F1 (Table II). The DK-SSCD approach largely closes the performance gap between the supervised and unsupervised methods. The FC-EF-Res network achieves the best performance, but is more computationally expensive and requires manually labeled data for training. Also, the supervised methods require higher registration precision between bitemporal images. The DK-SSCD approach based on image patches is enough for obtaining a good change map with the registration precision of the Sentinel system.
Besides the quantitative analysis, we also provide a visual comparison (Fig. 4). CVA results show large green areas where  III  ABLATION STUDY OF FEATURE SELECTION METHOD   TABLE IV  ABLATION STUDY OF EPM changed samples are misclassified as unchanged samples. Since pixelwise MAD, ISFA, and KPCA-MNet methods are very difficult to extract spatial context information, these change maps have many unchanged samples misclassified as changed. In the first scenario, the Chen Prop. SSL method fails to detect most of the changed pixels. The change maps of the RLJA method have more misclassifications than the change maps of the DK-SSCD approach in the first scenarios. The RLJA and Chen Prop. SSL methods are convolutional neural networks that the based-CNN models have poor ability in global information extraction. The DK-SSCD approach uses the vision transformer small as the backbone network. The self-attention and multihead attention mechanism in the vision transformer can mine long-range correlation dependencies to obtain the relationships between pixels in the entire image, improving the model performance.
To sum up, the DK-SSCD approach and the FC-EF-Res supervised method provide better change maps than the other methods. The FC-EF-Res method is more accurate and less noisy by using manually labeled data to train the model. But it fails to detect most of the unchanged pixels in the fourth scenario. The fully supervised method heavily relies on the quality of training samples and the distribution of changed ground features in manually labeled data. Although some small objects are false in the result of the DK-SSCD approach, this may be because of some small changes that are ignored in the ground truth. Manual annotation is hard to recognize these small changes in the Sentinel-2 images because the 10-m spatial resolution is too low.

D. Ablation Studies
In this section, we evaluate the impact of different modules and methods on the final model accuracy through ablation studies, including mean teacher (Fig. 5), feature selection (Table III), core modules (Tables IV and V), and loss weighting (Table VI).

1) Superiority of Inference Using Mean Teacher:
We compare the performances using the mean teacher and student network during inference, as shown in Fig. 5. The green and red point lines denote that the mean teacher and student network are used as the model to implement CD during inference, respectively. The mean teacher network uses the EMA weight to produce a more accurate model instead of sharing the weights with the student network. Usually, the model weights will fluctuate at the optimum point in the last several epochs. The weight averages can improve all layer output and aggregate information of each epoch, which helps the model to represent intermediate features more accurately. Therefore, the mean teacher network performs better than the student network, which achieves a performance improvement of around 0.95% in F1. In addition, the mean teacher network effectively reduces the drastic fluctuation of accuracy. The EMA weight can weaken the impact caused by noise, improving the model robustness [30], [65].
2) Effectiveness of Feature Selection: Table III shows that our proposed feature selection method can effectively improve the CD accuracy. The low-similarity feature vectors and the high-similarity feature vectors denote that we select the features with a similarity less and greater than 0.5 feature vectors to calculate contrastive loss, respectively. By selecting high-similarity feature vectors, the Kap and F1 metrics are improved by 3.58% and 3.25%, respectively. All five metrics are improved, among Pre achieves the most significant improvement of 6.4%. But the Kap and F1 metrics are only improved by 0.63% and 0.58% when selecting low-similarity feature vectors. The performance of the feature vectors with low similarity is lower than that of the feature vectors with higher similarity. After feature selection, the feature vectors with higher similarity are used to implement the hard negative sampling strategy, which helps to obtain useful negative samples and provides significant gradient information for representation learning [43].
3) Effectiveness of Core Modules: In Table IV, the experiments are divided into the without EPM and with EPM. Previous articles have shown that the smaller patch may make it difficult for the backbone to extract valuable features [53], [54]. Without EPM, the 8 × 8 patch is directly input into the backbone network, and the Kap and F1 are only 51.39% and 53.96%, respectively. However, when we input the larger patch, the change map is too rough and reduces the CD accuracy. To solve this contradiction problem, we design the EPM to expand the patch size input into the backbone network. In Table IV, we test the different module layers to verify the effectiveness of the EPM, including the one-layer, two-layer, and three-layer modules. The two-layer achieves the best performance with a Kap of 53.34% and an F1 of 55.69%. The three-layer has the worst performance. The three-layer is too deep, causing overfitting problems, which lost some image information and decreases the model accuracy. The EPM expands the patch size from 8 × 8 to 38 × 38, which is beneficial for improving model performance. Without EPM, we input the 38 × 38 patch into the backbone network, and the Kap and F1 are only 31.59% and 34.08%, respectively. Our designed EPM not only expands patch size but also improves the CD accuracy. The impact of critical parameters of input data is discussed in the discussion section.
The upsampling methods include interpolation and deconvolution. In Table IV, we design the different upsampling methods to expand patch size. We use bilinear, nearest, and deconvolution to replace dilated deconvolution in EPM. The bilinear method uses bilinear interpolation to calculate the value of each pixel with the position as the weight, which obtains the worst performance. The nearest method directly assigns the nearest pixel value in the original image to the pixel value of the enlarged new position. The bilinear and nearest methods do not have the capabilities to extract image features. The deconvolution not only retains the image information but also has feature extraction capabilities, so the deconvolution is better than the interpolation method. The dilated deconvolution has a larger receptive field than deconvolution, which helps obtain richer feature information to improve accuracy. When dilated deconvolution is used in too many layers, the continuity of information will be lost, reducing model performance. Therefore, our experiment shows that the two-layer has the best performance. Hendrycks et al. [76] proposed the Gaussian error linear units (GELU), which is a high-performing neural network activation function. The activation function has better nonlinear expression ability and numerical stability than other activation functions, which maybe to improve our model performance. Therefore, we tested the influence of GELU in the EPM. But the experiment with GELU shows that the Kap and F1 are reduced by 10.67% and 10.96%, respectively. Usually, the activation function is used in the model projection layer or deeper network layer to regularize unimportant activation information [77]. The EPM is the shallow network layer, there the image features are difficult to clearly distinguish the key information. Therefore, important information will be lost using the activation function in the EPM, which makes performance decline. In summary, our designed two-layer EPM consisting of dilated deconvolution and BN achieves the best performance.
In Table V, we design one-layer, two-layer, three3-layer, and four-layer in MLP, and the one-layer has the worst performance. In three-layer MLP, Kap and F1 are significantly improved by 6.79% and 5.87% with GELU, respectively. The GELU in the projection head combines changed and unchanged probability values, which is more helpful for CD. The last layer of the projection head is the 2 normalization bottleneck. In deep-learning networks, the 2 normalization helps the network to converge quickly and stabilizes training. Kap and F1 are improved by 14.66% and 12.86% with the 2 normalization bottleneck, respectively. In summary, our designed projection head is composed of a three-layer MLP with the GELU activation function and an 2 normalization bottleneck, which achieves the best performance.

4) Effectiveness of Loss Weighting:
In Table VI, we perform ablation experiments on loss weighting. The model performance is the best with λ 2 = 0.3 and is the worst with λ 2 = 0.7. The domain knowledge from the RSIs mainly highlights some ground feature information. The NDVI only benefits the differentiation between forest, grassland, and cropland. The NDWI only has a high response to water due to the scattering and absorption properties of water. The NDBSI emphasizes the feature of the bare surface. Therefore, domain knowledge accounts for a small proportion of loss weighting, which can correctly guide the model to improve performance during training and inference. If the proportion is too large, the model will be biased to domain knowledge features, resulting in performance degradation.

V. DISCUSSION
In this section, we discussed the performance of the different domain knowledge fusion methods and the sensitivity analysis of parameters.

A. Domain Knowledge Fusion Method
In Table VII, we consider other domain knowledge fusion methods to analyze the effect of domain knowledge. These methods are described as follows: 1) Without RSIs feature vectors: We do not use the RSIs feature vectors during training and inference. 2) Concatenation: First, the concatenation operator directly concatenates the blue, green, red, NIR bands, and four RSIs feature vectors as eight-channel data. Then, the eight-channel data are fed into the network architecture for representation learning and the mean teacher network to detect change during inference. 3) Training: Compared to our proposed fusion method, the difference is that the inference process does not use four RSIs feature vectors to generate the CIM. 4) Inference: Compared to our proposed fusion method, the difference is that the training processing does not use four RSIs feature vectors for representation learning. 5) Our Proposed: Our proposed fusion method in the DK-SSCD approach uses four RSIs feature vectors during training and inference to fuse the domain knowledge. Table VII shows that the concatenation fusion method obtains the worst performance, and all evaluation metrics are less than the without RSIs feature vectors method. The concatenation fusion method concatenates the image bands and four feature vectors of RSIs as eight-channel data. The eight-channel data are fed into a model during training, which may cause the representation learning to be biased toward domain knowledge features. Thus, the model difficulty identifies the changes in various ground objects. The training and inference fusion methods both perform worse than our proposed fusion method. The training fusion method does not use domain knowledge during inference, so the method performance is similar to the without RSIs feature vectors method. And both methods perform worse than our proposed fusion method and the inference fusion method. Although the RSIs feature vector data are not used in the training process with the inference fusion method, the method performance is only worse than our proposed fusion method with a Kap of 49.66% and an F1 of 52.35%. The inference fusion method does not use domain knowledge to guide the model during training, resulting in limited improvement in the CD accuracy. Our proposed fusion method in the DK-SSCD approach uses the RSIs data during training and inference, the domain knowledge effectively guides model training and inference to achieve the best performance. The Kap and F1 metrics of our proposed fusion method achieve 53.34% and 55.69%, respectively.

B. Sensitivity Analysis of Parameters
We tested all hyperparameters in the DK-SSCD approach, the three parameters of PS, US, and USI have the most influence on the CD results. We quantitatively evaluate the performance of the DK-SSCD approach on the different patch sizes and unfold strider using the F1 metric, as shown in Fig. 6. The optimal parameter settings are the PS of 8, the US of 4, and the USI of 1, achieving an F1 of 55.69%. The USI denotes the unfold strider during inference and the cell size to generate the CIM, determining the fineness of the change map. Therefore, the experimental results show that the USI parameter has the greatest influence, and the smaller the value, the finer the CD result.

VI. CONCLUSION
In this article, we propose a novel DK-SSCD using bitemporal images. Our proposed network architecture provides a good feature representation space that can highlight changed information for bitemporal images using contrastive learning and domain knowledge. The mean teacher network improves the CD accuracy and effectively reduces the drastic fluctuation of accuracy during inference. Our experiment demonstrates that the Kap and F1 metrics are improved by 5.83% and 5.13% by fusing domain knowledge to guide model training and inference. In addition, our proposed feature selection method effectively improves the CD accuracy with an improvement Kap of 3.58% and F1 of 3.25%. Compared with the other state-of-the-art CD methods, the DK-SSCD approach achieves better performance.
Our further article will extend the model applied in a large number of images from the time series in the same area, which mine spatial-temporal information to refine change maps. In addition, we will further study an unsupervised clustering algorithm to investigate semantic change and reveal semantic categories before and after the change.