Pseudo Features-Guided Self-Training for Domain Adaptive Semantic Segmentation of Satellite Images

Semantic segmentation is a fundamental and crucial task that is of great importance to real-world satellite image-based applications. Yet a widely acknowledged issue that occurs when applying the semantic segmentation models to unseen scenery is that the model will perform much poorer than when it was applied to scenery similar to the training data. This phenomenon is usually termed as the domain shift problem. To tackle it, this article presents a self-training-based unsupervised domain adaptation (UDA) method. Different from the previous self-training approaches which focus on rectifying and improving the quality of the pseudo labels, we instead seek to exploit feature-level relation among neighboring pixels to structure and regularize the prediction of the adapted model. Based on the assumption that spatial topological relation is maintained despite the impact of the domain shift, we propose a novel self-training mechanism to perform DA by exploiting local relation in the feature space spanned by the teacher model, from which the pseudo labels are generated. Quantitative experiments on four different public benchmarks demonstrate that the proposed method can outperform the other UDA methods. Besides, analytical experiments also intuitively verify the proposed assumption. Codes will be publicly available at https://github.com/zhu-xlab/PFST.

at the pixel level, semantic segmentation [1], [2], [3], [4] has been serving as an important technique in satellite image processing-based applications such as urban planning [5], land use, and land cover mapping [6], automatic agriculture [7]. With the renaissance of deep learning, the performance of data-driven semantic segmentation algorithms has been pushed to a new era. However, such a performance boost largely relies on the emergence of large-scale manually annotated data.
This leads to a problem, that is when applying the semantic segmentation model in unseen scenario without sufficient labels, its performance may drop drastically compared to its performance on the source domain. In the field of remote sensing, since the satellite image data are highly diversified, biases and shifts widely exist between the source and the target domain (where we train and evaluate our model, respectively). Such shifts may result from the differences in the used sensors, different atmospheric conditions, seasonal changes, distributional biases of the ground objects, and so on. To tackle this issue, domain adaptation (DA) techniques [8], [9], [10] has been attracting more and more attention.
DA leverages the source and the target domain data at the same time to bridge the shifts between them. Unsupervised DA (UDA) is a common and practical DA setting where only the target data are available, without any target labels provided. One popular technique for UDA is self-training, which has consistently achieved state-of-the-art results [11], [12], [13]. The fundamental idea behind self-training is to generate pseudo labels for the target domain data using a source-trained model, and then fine-tune the UDA model using selected high-confidence pseudo labels. Many of these methods focus on evaluating the quality of pseudo labels and developing selection strategies to filter out noisy pseudo labels [14], [15].
However, previous self-training methods often overlook the potential benefits of utilizing feature-level knowledge from the teacher model, which is used to generate pseudo labels. Pseudo labels are susceptible to noise [illustrated in Fig. 1 (left)], possibly due to domain shifts that bias the distribution of target objects in the low-dimensional output space. This realization raises the question: can the higher-dimensional pseudo features generated by the teacher model (referred to as pseudo features) be more robust to domain shifts compared to the pseudo labels that lie in the low-dimensional output space? This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Fig. 1. Illustration of our motivation. Traditional pseudo-labeling or entropy minimization methods heavily rely on the correctness of initial predictions. We propose to model the relation between neighboring samples to counteract the negative effect of wrong initial predictions. Detailed discussion will be presented in Section III-A.
Intuitively, the answer seems to be positive. Based on such an assumption, we propose a pseudo features guided selftraining (PFST) method that leverages feature-level relations in addition to pseudo labels. We assume that while domain shifts can affect the consistency and accuracy of target predictions in the output space, the spatial topological relation is relatively preserved in the high-dimensional feature space. Building on this insight, we propose to regularize target outputs using feature-level local relations between the pseudo features.
Specifically, we measure the similarity between neighboring feature pairs generated by the teacher model. For the most similar pairs within each local region, we strengthen the correlation of their corresponding target probability outputs. Conversely, for the most dissimilar pairs, we reduce their output-level correlation. By exploiting feature-level relations as a more robust and domain shift-insensitive source of information, we aim to improve the traditional self-training process. Our contributions can be summarized as follows.
1) We propose the assumption that the high dimensional local similarity structure in the pseudo feature space is less sensitive to domain shift than output space pseudo labels when applying the source-trained model on target data. Besides, we experimentally verify its correctness. Such an assumption is insightful to the development of further self-training-based UDA methods. 2) We develop a novel self-training approach named PFST.
By exploiting feature-level local relation, PFST establishes the connection between the teacher model feature space and the student model output space, which further helps to counteract the negative effect caused by the noisy pseudo labels and improve the generalizability of the UDA model. 3) We establish and release a standard code library for UDA in remote sensing based on MMSegmentation [16] and EarthNets [17]

II. RELATED WORKS
In this section, we will review some of the key techniques, strategies, or approaches that promote the development of UDA, both in computer vision society and in remote sensing society.

A. Adversarial Learning
Adversarial learning applies generative adversarial networks (GANs) [18] to perform adaptation between two or more domains. The philosophy is to train a discriminator together with the generator. While the discriminator is trained to be able to distinguish whether the generator's outputs come from the source or the target domain, the generator will be trained to confuse the discriminator to do so. In such a manner, the outputs from the generator conditioned on different domains will be undistinguishable and consistent, and the domain shifts will be reduced. According to the scale or the level of the networks where adversarial learning is applied, this line of work can be further categorized into image-level, feature-level, or output-level approaches.
1) Image-Level Adversarial Learning: In the earlier stage of the research toward adversarial learning, it is more widely used in the field of image generation [19] or image style transfer [20] instead of DA. Typical works like CycleGAN [21] and StarGAN [22], [23] apply GAN to transfer the image style, e.g., transfer oil painting to photos, images of zebra to horse, human faces of different gender and hair styles. Conditional GAN [24] extends these applications to perform transfer between images and their semantic annotations.
In the field of remote sensing, image-level adaptation or image-to-image translation are widely utilized to standardize the style of images from different domains. Compared to natural scene images, remote sensing data are usually multimodal [e.g., RGB, multispectral, hyperspectral, and synthetic aperture radar (SAR) data], multisensory, multitemporal, or geolocationally diversified. As a result, the difference in image appearances has a large impact on the generalizability of the downstream model. Previous works have demonstrated the usefulness of applying image-to-image techniques in remote sensing data. For example, Bidirectional Domain Adaptation Network (BiFDANet) [25] applies a CycleGAN architecture to perform image-to-image translation between the source and the target images, after which a semantic consistency loss is applied to the outputs of the original and the stylized images. StandardGAN [26], DAug [27] utilize StarGAN-like architectures and adaptive instance normalization (AdaIn) [28] to transfer the style of images captured from multiple cities, and enables multisource multitarget DA.
2) Feature-and Output-Level Adversarial Learning: Apart from the image-level domain shift, there are always biases between the source and the target domains that cannot be transferred solely by image-level style transfer. For example, difference in the spatial geometry or the unbalance distribution of source and target semantic objects. To mitigate those latent high-level domainwise biases, aligning the source and the target domain in feature-or output-level becomes necessary. AdaptSeg [29] highlights the importance of adopting adversarial learning in the output space and uses GAN to align the source and the target output. AdvEnt [30] discovers the effectiveness of entropy minimization in the target domain, and further proposes to align the source and the target entropy map in an adversarial manner.
In the field of remote sensing, full space domain adaptation network (FSDAN) [31] uses a CycleGAN structure to generate target-style source images to mitigate the domain shift problem. After that, they also apply feature-level and outputlevel adversarial learning to further improve the adaptation performance, which leads to a full-space alignment between the source and the target domain. Entropy-guided adversarial domain adaptation (EGA) [32] proposes an entropy-guided adversarial learning algorithm. While adversarial learning is conducted on the output level, a self-adaptive weight is calculated to reweight the prediction from the discriminator. Triplet adversarial domain adaptation (TriADA) [33] designs an output-level adversarial learning method based on the triplet loss, where an image triplet from the source and the target domain is input to the semantic segmentation network during training. Unlike the previous method, the discriminator is devised as a similarity metric to measure the domain-level similarity between two input images.

B. Self-Training
Adversarial learning-based methods are often characterized by their instability and difficulties in optimization. However, self-training [34], [35], [36], [37], which involves fine-tuning the UDA model using pseudo labels generated from target data, offers a more efficient way to leverage target information and is typically easier to optimize. In the field of computer vision, an important focus of self-training-based methods is to effectively filter out noisy pseudo labels. For example, CBST [14] points out that the training with pseudo labels suffers the risk of being overwhelmed by easy-to-transfer classes, and proposes to balance the distribution of pseudo labels by applying a classwise confidence threshold. To prevent the self-trained network from being over-confident during the learning toward hard pseudo labels, confidence regularized self-training (CRST) [15] argues to regularize the self-training process by using soft labels. In uncertainty reduction for model adaptation (URMA) [38], the prediction uncertainty is estimated via the variance of different network outputs, which is later used to automatically weigh the pseudo labels. Prototypical domain adaptation (ProDA) [39] maintains a set of prototypes for each class during training, and the relative distance between features and prototypes is used to rectify the false pseudo labels.
In remote sensing, it seems self-training receives less attention. Wang et al. [40] establish a benchmark for evaluating different domain adaptive semantic segmentation methods, where they study the effect of some classic DA methods proposed in the computer vision society. Zhang et al. [41] integrate the adversarial learning mechanism into the selftraining pipeline to perform UDA in the task of road extraction. In remote sensing scene classification, there are also works like [42], which studies the influence of different strong augmentation applied to the student model branch in the selftraining pipeline.

C. Data Augmentation and Other Techniques
Since self-supervised learning has achieved remarkable progress [43], the importance of data augmentation has been widely acknowledged. Among all different data augmentation methods, data mixing [44] has been demonstrated to be effective both in classification [44] and semantic segmentation [45]. Domain adaptation via cross domain mixed sampling (DACS) [45] studies the impact of data mixing in the field of UDA, where they use the ClassMix [46] strategy to cutout half of the classes in a single source image, and overlay the target image on top of the cut area.
In remote sensing, there are also works like randomized histogram matching (RHM) [47] pointed out that simple colorwise data augmentation strategy like RHM can produce comparable semantic segmentation results than complex image-to-image translation-based methods.
Other DA works focus on devising a better sampling strategy to reduce the influence of the domain shift. DAFormer [12] proposes a rare class sampling strategy to balance the distribution of different semantic classes. The oversampling of the rare classes tends to mitigate the long tail issue [48] and improve the generalizability of the semantic segmentation model. Curriculum-style local-to-global cross-domain adaptation (CCDA) [49] proposes a curriculum-style UDA method that rank the target patches from easy to hard according to the output entropy. Those patches are then fed to the network Fig. 2. Illustration of our proposed PFST. A teacher model and a student model will be maintained. The teacher model generates pseudo labels based on the target image to supervise the student model. The student model is trained on both the source labels and the pseudo labels. Its weights will be used to update the teacher model by exponential moving average [50]. The source and the target images are augmented via weak or strong data augmentation before input to the models (please refer to Section IV-B for more details). We calculate sliding windows over the teacher model features and the student model outputs simultaneously. For each corresponding window pair, we apply a local similarity loss on output probabilities according to their local feature similarity. In such a way, we incorporate a new regularization mechanism that connect the teacher model feature space and student model output space. from easy to hard. This curriculum-based sampling strategy is reported to be effective.

III. METHODS
The overall architecture of our approach is illustrated in Fig. 2. In Section III-A, we will first introduce and illustrate the DA problem and describe the motivation of the proposed method. Then in Section III-B, we formulate the UDA setting for semantic segmentation. Later on, we present the optimization objectives and other loss functions in Sections III-C-III-E, respectively. In order to address the issue of optimization direction when such errors occur, we propose the utilization of local similarity and local discrepancy loss. Specifically, in case 1 where a sample belonging to class A is misclassified as class B, if we have correctly classified class A samples in its local neighborhood with high similarity, maximizing their outputlevel correlation can help redirect the optimization toward the expected direction. Similarly, in case 2 where a sample from class B is misclassified as class A, if it exhibits larger discrepancy with nearby class A samples in the source feature space, minimizing the output correlation between them can also aid the optimization process.

B. Problem Formulation
Let be the source and the target domain data, and {y s i } N s i=0 and {y t i } N t i=0 be the corresponding labels. Here x s i , x t i ∈ R H ×W ×3 denote the source and the target images, while y s i , y t i ∈ R H ×W indicate their labels. H and W specify the height and width of the images. Note that the target domain labels {y t i } N t i=0 are only available during the evaluation time. N s and N t indicate the sizes of the source and the target datasets. With these notations defined, the UDA problem for semantic segmentation can be formulated as where L(·) is the loss function, θ S and θ T are the parameters of the student and the teacher models, respectively. In self-training-based UDA methods, the teacher model T is usually used to generate pseudo labels to supervise the student model S. T can be either pretrained on the source domain data in an offline manner [34] or updated according to the student model weights θ S via exponential moving average [12]. In our method, adopt the latter strategy. More specifically, weights of the teacher network θ T will be updated by where t denotes the current iteration step. The decay weight α is set to 0.999 following [12].
Besides, we denote h as the feature extractor, g as a classifier of the network model, and f = h • g as their composition. Target feature and output probability are denoted by h(x t ) ∈ R h×w×d and f (x t ) ∈ R h×w×c , where h × w corresponds to spatial dimension of extracted feature map. d indicates the feature dimension and c is the number of classes.

C. Objective Function
We define our optimization objective as where L src is the source domain semantic segmentation loss, defined as Here 1 y s i is the one-hot encoding of the source label. L pse is the pseudo label loss widely used for self-training approaches in the field of DA. We adopt a weighted pseudo label loss used in previous works [12], [45] Here 1ỹt is the one-hot encoding of the teacher predictionỹ t , whereỹ t = f θ T (x t ) corresponds to the pseudo label generated by the teacher model on the target image. q(x t ) is a weighting factor that balances the loss based on the predicted confidence on each target image It counts the number of pixels where the classwise maximum output probability is larger than a certain threshold τ . τ is fixed to 0.98 empirically in all our experiments.

D. Local Similarity Loss
Local similarity loss L loc is used to supervise the student model by exploiting feature-level similarity implied in the teacher model Specifically, it strengthens the correlation between two neighboring target outputs that share a strong similarity in the feature space defined by the teacher model, with a positive loss term L pos . Meanwhile, it increases the discrepancy between target outputs that have weak similarity in the feature space using a negative term L neg . The positive term is defined as Here A θ is a feature space similarity measurement for a pair of features extracted by a deep model θ. For simplicity, we adopt the cosine similarity i defines a sliding window centered at position i (i itself excluded). Then + i contains the top ξ locations within i that yield the highest A θ (x t i , x t j ) value. ξ is set to 3 in all of our experiments. I + i, j is a measurement in target output space evaluating the probability that two nearby located pixels produce the same prediction where p = f θ T (x t ) is the target model's output. p i p T j ∈ R c×c measures the joint probability distribution of p i and p j regardless of their dependence.
The negative loss term L neg is defined as In contrast to + i in (8), here − i defines the top ξ locations that have the lowest A θ T value within a sliding window. Different from I + i, j , I − i, j measures the probability of cases where p i and p j indicate different classes Note that there is I + i, j + I − i, j = 1 since p is probabilistic. By imposing L pos only to + i and L neg to − i , a relative local relation is considered in addition to the absolute one incorporated by A θ T (x t i , x t j ). The intuition behind this is that feature pairs of the same class are more likely to lie in + i , while feature pairs of different class mostly lie in − i .

E. Source Feature Distribution Loss
When applying the local similarity loss L loc on the target domain, we hope the feature similarity between pixel pairs that share the same label is large, while the similarity between pixel pairs with different labels is small. To achieve this, we introduce a feature distribution loss on the source domain, aimed at increasing the separability of the two similarity distributions. In this context, let's consider a scenario where the similarity values between positive and negative feature pairs follow two unknown distributions denoted as P θ S and N θ S , respectively. These distributions can be characterized by their means, denoted as µ pos and µ neg , respectively, and their standard deviations, denoted as σ pos and σ neg .
We hope the two distributions can be as distinguishable as possible. To this end, we apply a feature distribution loss L feat on the source domain using the labeled source data L feat (x s ) = −μ pos +μ neg +σ pos +σ neg (14) whereμ pos andσ pos are the mean and standard deviation of the similarity between all the positive pixel pairs within all the sliding window. They are used to approximate µ pos and σ pos Likewise,μ neg andσ neg are the mean and standard deviation of the negative pixel pairs (given j ∈ i , y s i ̸ = y s j ). By minimizing the value of −μ pos +μ neg , the difference between the means of the two distributions is maximized, leading to increased separation between them. Additionally, by minimizingσ pos andσ neg , the standard deviations of the two distributions are reduced, further enhancing the distinction between them. As a result, the two distributions become more effectively separated from each other in the end.

A. Datasets and UDA Settings
We use four public datasets, including ISPRS Potsdam, 1 Vaihingen, 2 SeasonNet [51] and Inria [52] to evaluate the performance of different UDA methods. Some sample images of different UDA settings are shown in Fig. 3.
Potsdam and Vaihingen datasets consist of aerial images captured over the Potsdam and Vaihingen cities in Germany. Potsdam dataset contains 38 images with a size of 6000 × 6000 and a ground sampling distance (GSD) of 5 cm. Potsdam offers both RGB and near-infrared, red, and green (IRRG) images, yet in our experiments, we only use the IRRG ones. Vaihingen dataset has 33 images with a size of 2000 × 2000, and a GSD of 9 cm. Three IRRG channels are given.

Both Potsdam and Vaihingen have six classes.
SeasonNet is a large-scale land cover and land use dataset captured over the whole Germany. It contains in total of 1 759 830 image patches from Sentinnel-2 sensor, with a patch size of 120 × 120, annotated to 33 land cover classes. All those patches are categorized according to the season when they are captured. In total, there are four seasons plus an additional "Snow" domain where most of the land cover is covered by snow. This makes it a realistic and ideal setting for evaluating different UDA methods against the temporal and seasonal domain shift.
Inria dataset is an aerial image labeling dataset created for building footprint extraction [52]. It has a resolution of 0.3 m and a coverage of 810 km 2 captured over ten European and American urban settlements. Pixel-level annotations of two classes, including building and background are provided in the training set. 36 images with sizes of 5000 × 5000 are given for each city.
Based on the above public datasets, we organize four different UDA setting to evaluate the performance of our method. The detailed descriptions are given below.

1) ISPRS Potsdam IRRG to Vaihingen IRRG:
In this setting, we consider Potsdam dataset with IRRG images as the source domain and Vaigingen dataset as the target domain. We split the images from both datasets to a patch size of 1024 × 1024. As the dataset provider gives official training and testing splits of these two datasets, we adopt the setting that we train the UDA model on labeled training split of the Potsdam IRRG dataset, as well as the training split of the Vaihingen dataset (without giving the label), validate the models on the Vaihingen train split and report the results on the Vaihingen test split.
2) ISPRS Vaihingen IRRG to Potsdam IRRG: This setting is similar to the previous setting, except that we switch the source and the target domain. As the result, we training the UDA model on the training split of Vaihingen, as well the training split of Potsdam IRRG (without providing the label), validate the models on Potsdam IRRG training split and report the results on Potsdam IRRG test split. 3) SeasonNet Spring to Fall: To involve temporal and seasonal changes, we consider the spring season as the source domain and the fall season as the target domain. Note that in this case, we want to create a moderate domain shift so that it will not be neither too easy nor too hard for UDA method to effect. As the dataset provider gives official train, validation, and test splits, we train UDA methods on the train split of the spring season, and validation split of the fall season (without providing labels), validate them on the validation split of the fall season and report the results on the fall season test split. 4) Inria Intercity: By introducing this setting, we want to evaluate the generalizability of different UDA methods across different geo-locations, i.e., different cities. Since only the five cities in the training set of the Inria dataset are provided with labels, we only utilize these cities. As a result, Austin, Chicago, and Kitsap are considered as the source domain cities while Vienna and Tyrol-w are considered the target domain cities. We follow the suggestion from the dataset provider to use the first five images from each city as the validation set, while the others as the training set. To this end, we train the UDA methods on the training set of the source domain cities, validate on the target training set and report the results on the target validation set.

B. Implementation Details
We reimplemented several classic and state-of-the-art UDA methods for evaluation and comparison. These methods include class-balanced self-training (CBST) [14], MaxSqu [53], MinEnt [30], AdvEnt [30], and DAFormer [12]. All the implementations are under the same codebase from MMLab [16] and EarthNets [17], where the data loading pipelines, network architectures, optimizers and training flows are shared, making the comparison fairer. We use a classic network architecture for all the methods, where DeepLabV3+ [54] is used as the decoder and ResNet50 [55] is used as the encoder.
Regarding the data normalization and augmentation, for the SeasonNet spring to fall setting, the data are converted from 16 to 8 bits by cutting out the values beyond µ ± σ for each channel, where µ and σ are the mean and the standard deviation of the whole datasets. The values within µ ± σ are then normalized to [0, 1] and further converted to 8-bits data. For the other settings, data are normalized using the ImageNet statistics [56]. To perform the data augmentation, we include random resizing, cropping, random horizontal and vertical flipping, random rotation of 90 • , 180 • , or 270 • , and random photometric distortion. For DAFormer and PFST which are based on online pseudo-label generation, these operations are considered as weak augmentation. ClassMix [46], color jittering, and random blurring are used as the strong augmentation. We observe that these data augmentations can largely improve the overall performances of different methods. About the hyperparameter settings of the proposed PFST, α and β in (3) are set as α = 0.1, β = 0.1. The sliding window size is set to 3 with a dilation of 2. Such hyperparameter settings are applied for all the UDA settings.
For optimizing the networks, Adamw optimizer [57] with 0.01 weight decay and 6e − 5 learning rate is used to train all the approaches. The batch size is set to 2 for Potsdam IRRG to Vaihingen IRRG (P2V), Vaihingen IRRG to Potsdam IRRG (V2P), and Inria intercity settings, and set to 32 for SeasonNet spring to fall setting. The number of iterations is set to 40k for all the settings. Polynomial learning rate decay is applied for all the methods. All the experiments are conducted on a single NVIDIA RTX 3090 GPU with PyTorch library.

C. Quantitative Results
We list the quantitative comparison results of different UDA methods on four different settings on Tables I-IV. From  Table I, one can observe that different UDA methods all make improvements over the baseline source trained model, especially on the foreground objects (categories except Clutter). Among all methods, DAFormer and the proposed PFST perform better on the "Clutter" classes, which may owe to the mix-up strategy that balance the distribution of rare classes. PFST performs the best at all categories except "Tree," demonstrating the effectiveness of mining the source domain feature-level similarity in distinguishing both foreground and background objects.
In the ISPRS V2P setting, shown in Table II, we notice that all UDA methods still improve over the baseline. One phenomenon to be noticed is that their performance variances on "Tree" class are larger. One possible reason is that in Potsdam-IRRG dataset, the "Tree" and the "Low Vegetation" classes are hard to be adapted, making them easily confused when making the prediction. The proposed PFST performs the best on these two classes, which could be explained by that the differences between these two types of objects can be better captured in the feature space. The results of the Inria intercity setting are shown in Table III. Compared to the previous setting, the improvements from different UDA methods are less obvious. Especially on Vienna city, methods include CBST, MinEnt, and AdvEnt cannot or can only slightly outperform the source model. Generally, the domain shift between Vienna and the source domain cities mainly lie on its larger and more complex building geometry, while the shift between Tyrol-w and the source domain cities are mainly on its color and appearance. This indicates the geometrywise differences between two domains are more difficult to be tackled. In this case, DAFormer and the proposed PFST can still provide stable improvements over both cities, which demonstrate their effectiveness.
The results on the SeasonNet spring to fall setting are shown in Table IV. This is a more challenging setting because seasonal changes usually have large impacts on some of the land cover types that related to agriculture, forests, natural landscape, and so on. From the per-class results and the averages results, one can tell that although DAFormer outperforms the others on some of the classes like C 4 and C 29 , it fails drastically on classes like C 30 and C 33 . This may be because the utilized rare class sampling strategy [12] samples too many repeated image patches from rare classes, resulting in the underfitting of some of the major classes. In general, PFST performs very stable, and can make improvements against the baseline source model on almost all the classes, and achieves the highest mean IoU values. This further demonstrates its robustness.

D. Qualitative Results
We visualize the semantic segmentation results of different UDA methods in four different settings in Figs. 4-7. As can be observed in Fig. 4, only DAFormer and the proposed PFST can detect and segment the fine-grained clutter structure (in red color) in the first row. From the third and the fourth rows, all the other UDA methods except PFST perform not very well at distinguishing the differences between "Tree" and "Low Vegetation" categories, e.g., MinEnt, AdvEnt, and DAFormer confuse the large "Low Vegetation" area in the last row with the "Tree" class.
From Fig. 5, it is shown that the difference between "Low Vegetation" and "Tree" classes is still the main challenging issue. Among all the methods, DAFormer and PFST perform the best on this goal if we look into the first, second, and the fourth rows. Besides, PFST can also detect some tiny clutter objects in the second row, despite it's still hard to segment the large clutter area in the third row due to its similarity to the "Imprevious surface" class. Fig. 6 shows the results on the Inria intercity setting. As highlighted in the red bounding boxes, PFST generally performs better at distinguishing the building structure that is easy to be confused with the background areas (like what is shown in the first and the fourth rows), and can also better captures the borders of separated building instances (highlighted in the second row).
The results on SeasonNet spring to fall setting are given in Fig. 7. Generally, all the methods can capture the overall land cover distribution in the image, yet they still tend to be confused when trying to distinguish between two similar classes. For example, in the first row, only CBST and PFST can distinguish the "Broad-leaved forest" (C 16 ) and the "Coniferous forest" (C 17 ) area highlighted with the red bounding box. In the second row, only PFST can recognize the "Fruit trees and berry plantations" (C 14 ) area inside the bounding box.  7. Visualized semantic segmentation results of different UDA methods on SeasonNet spring to fall setting. For the sake of simplicity, we use "C1"-"C33" to denote the class labels. For the actual class names, please refer to [51].

E. Ablation Study
To evaluate if the idea of mining the feature-level local similarity from the source domain model can really help the target model to generalize on the target domain, we ablate over the proposed local similarity losses and other components on the ISPRS P2V and V2P settings. As can be seen from Table V, self-training with exponential moving average plays an important and fundamental role in setting a strong baseline in our method, resulting in around 10% and 6% performance improvements on these two settings. If we apply strong augmentations on top of the student model branch during the self-training, we see a further performance increase on P2V setting, although the influence is not that obvious on V2P setting. In terms of the proposed local similarity loss, we can see it helps to further boost the performance of the UDA method on both settings, with relatively large margins (more than 2% and 3%) on top of an already very strong baseline. As for a nonparametric component, this result is promising and proves the effectiveness of leveraging feature-level local relation.

F. Interpreting the Local Feature Relation
To better explain the effectiveness of exploiting local relation, some verifying experimental results are presented in this Fig. 8. Per-class feature similarity distribution for all the sliding windows on ISPRS P2V setting. The class is defined according to the label of the center pixel. For most of the classes, one can observe that the feature similarity of Case 1a and Case 2a are generally larger than the similarity of Case 1b and Case 2b, indicating that the similarity between local pseudo features are more accurate than the pseudo labels in revealing the true relation between neighboring target outputs. section. Considering each local region defined by i that is sliding over the target image x t , we assume that the center pixel x t i is correctly classified by the semantic segmentation model, i.e., argmax f θ (x t ) i = y t i . To this end, by investigating whether the pseudo labels of x t i and its neighborhood x t j are of the same class or not, there will be two cases.
1) argmax f θ (x t ) i = argmax f θ (x t ) j : In this case, the two neighboring pixels have the same pseudo labels. There will be another two subcases depending on whether the prediction on x t j is correct or not, i.e., argmax f θ (x t ) j = y t j or argmax f θ (x t ) j ̸ = y t j . We denote these two subcases as Case 1a and Case 1b.
2) argmax f θ (x t ) i ̸ = argmax f θ (x t ) j : In this case, the two neighboring pixels have different pseudo labels. Similarly, there will be two subcases according to whether there is argmax f θ (x t ) j = y t j or argmax f θ (x t ) j ̸ = y t j . These two subcases are denoted as Case 2a and Case 2b.
With the above listed cases, we seek to verify the assumption that the pairwise feature similarities are more likely to reveal the true relationship between each pair of the neighboring pixels than the pseudo labels. In both Case 1a and Case 1b, the pseudo labels give the same predictions to the neighboring pixels, yet these predictions are correct in Case 1a, while incorrect in Case 1b. Hence our assumption can be supported if the pairwise similarity values in Case 1a are statistically larger than those in Case 1b. Likewise, since the pseudo labels give different predictions to the neighboring pixels both in Case 2a and Case 2b, our assumption can be supported if the similarity values in Case 2a are larger than those in Case 2b. As shown in Fig. 8, the expected phenomena can be observed, which verifies our assumption.

G. Limitations and Failure Cases
To analyze the limitations of the proposed method, we present some failure cases of the proposed methods in Fig. 9.
In the first row, it can be seen that PFST fails to recognize the central basketball field, which is a rare class that is not well-represented in the training set, and misclassifies it as the "Clutter" class. This suggests that the proposed local spatial layout (LSL) method may not be effective in recognizing outof-distribution targets. In the second and third rows, PFST misclassifies the central "Impervious Surface" and the upper "Building" area, possibly due to the lack of spatial context. In such scenarios, the proposed method may not provide significant improvement. In the last row, PFST misclassifies the upper "Pastures" area as "Vineyards" area, which have similar appearances, indicating that the high-dimensional feature-level relation may not be sufficient to distinguish targets that exhibit only subtle differences.
These failure cases highlight the limitations of the proposed methods in handling rare classes, lack of spatial context, and subtle differences in appearance, indicating areas where further improvements may be needed.

V. CONCLUSION
In this article, we observe that domain-invariant knowledge can be better preserved within high dimensional featurewise topological relation than output space pseudo labels for UDA. Inspired by this, a novel self-training mechanism is developed to regularize target outputs using local relation within source feature space. The proposed method is evaluated on four standard UDA settings, and the results show that it achieves superior performance compared to the existing UDA methods.
While the proposed method has shown success in general cases, its performance may be limited when dealing with limited spatial contexts or out-of-distribution targets. Additional research is required to overcome these challenges and improve the method's robustness and effectiveness in such scenarios.