SAM-driven MAE Pre-training and Background-aware Meta-learning for Unsupervised Vehicle Re-identification

Distinguishing identity-unrelated background information from discriminative identity information poses a challenge in unsupervised vehicle re-identification (Re-ID) tasks. Additionally, Re-ID models suffer from the challenge of varying degrees of background interference caused by continuous scene variations. The segment anything model (SAM), recently proposed, has demonstrated exceptional performance in zero-shot segmentation tasks. The combination of SAM and vehicle Re-ID models can achieve the efficient separation of vehicle identity and background information. This paper proposes a method that combines SAM-driven mask autoencoder (MAE) pre-training and background-aware meta-learning for unsupervised vehicle Re-ID. The method consists of three sub-modules. First, the segmentation capacity of SAM is utilized to separate the vehicle identity region from the background one. Given that SAM cannot be robustly employed in exceptional situations, such as ambiguity and occlusion, in vehicle Re-ID downstream tasks, a space-constrained vehicle background segmentation method is presented to obtain accurate background segmentation results. Second, SAM-driven MAE pre-training is designed. It utilizes the aforementioned segmentation results to select patches that belong to the vehicle and mask all other patches, allowing MAE to learn identity-sensitive features in a self-supervised manner. Finally, a background-aware meta-learning method is developed to fit varying degrees of background interference under different scenarios by combining different background region ratios. Extensive experiments confirm that the proposed method demonstrates state-of-the-art performance in reducing background interference variations.


Introduction
Vehicle re-identification (Re-ID) aims to perform feature similarity matching on specific vehicle targets in a cross-camera system [1][2][3].Previous studies [4][5][6] have discovered that background information limits the capability of Re-ID models to distinguish identity information, especially in unsupervised vehicle Re-ID tasks that lack annotation.Vehicles with the same identity contain varying degrees of background information in different surveillance scenarios, which make Re-ID models sensitive to background variations.Thus, the issue of background variations greatly limits the implementation of vehicle Re-ID tasks and poses many challenges.
The primary challenge lies in the fact that distinguishing between identity-unrelated background information and discriminatory identity information poses an obstacle for vehicle Re-ID models.Existing methods [7,8] focus on removing background interference information during the training process, thereby enhancing the sensitivity of Re-ID models to identity information.The identity and background information contained in vehicle images do not exist independently but have interdependent relationships in space.Therefore, directly removing background information may cause the learned features to lose high-dimensional spatial information, thereby reducing the robustness of Re-ID models to background variations.Recently, masked autoencoder (MAE; [9]) have been applied in vision pre-training tasks.MAE perform random masking operations on the training set and decode masked patches by encoding unmasked patches, prompting the training model to learn information related to unmasked patches.However, MAE cannot be efficiently applied to downstream vehicle Re-ID tasks.The random masking strategy exacerbates the interference of background information in downstream Re-ID models because of the possibility of discarding vehicle patches with identity information and retaining identity-unrelated background patches.Inspired by the segment anything model (SAM; [10]), this work aims to obtain high-quality background segmentation results through low-cost prompt engineering as a guide for MAE to selectively preserve identity patches.
The second challenge is how to make Re-ID models adapt to varying degrees of background interference caused by different scene variations.Many researchers [11,12] have regarded different background information as different domains, and this approach uses cross-domain transfer techniques to promote the alignment of different degrees of background interference.This type of method requires multiple-style transfers of samples in different scenarios, so it is difficult to apply to large-scale datasets, such as VeRi-Wild [13].Recently, meta-learning-based methods [14][15][16] have achieved ideal results in overcoming domain generalization problems in Re-ID tasks.This paper believes that treating different degrees of background interference as different domains can help model learning adapt to background changes through meta-learning methods.The objective of this paper is to explore a background-aware meta-learning strategy by utilizing the region ratio of background information as vehicle identity information so that the Re-ID model can adapt to varying degrees of background information interference.
SAM-driven MAE pre-training and a background-aware meta-learning method are developed to overcome the aforementioned challenges.Experiments confirm the effectiveness of the proposed method on two publicly available datasets (i.e., VeRi-776 [17] and VeRi-Wild [13]).The main contributions of this work are summarized as follows: (1) To ensure the robustness of SAM in performing zeroshot segmentation tasks in the vehicle Re-ID dataset, this paper proposes a space-constrained vehicle background segmentation method to optimize the background segmentation results by introducing a simple visual encoder in SAM for mining the spatial relationship between the vehicle and background region.
(2) SAM-driven MAE pre-training is proposed to enable downstream Re-ID models to learn background-unrelated identity features.Specifically, MAE is guided to selectively encode the vehicle patches by analyzing the input samples and optimized segmentation results.Then, through decoder reconstruction, the encoder indirectly learns vehicle context information related to unmasked patches.
(3) A background-aware meta-learning method is designed to make the Re-ID model adapt to varying degrees of background interference on the basis of different background region ratios.

Unsupervised Vehicle Re-ID
Unsupervised vehicle Re-ID task aims at mining the vehicle identity information without labeled annotations.Existing methods employ clustering-based pseudo labels as supervised information to optimize the whole unsupervised training process.Some researchers [18][19][20] have improved the pseudolabel generation method to improve the performance of Re-ID.Yu et al. [18] maintained a global feature dictionary and considered the similarity between samples from three aspects based on the feature dictionary to obtain more reliable identity information than density based clustering.Lu et al. [19] considered that using only global features to generate pseudo labels is unreliable, and therefore using multi-view vehicle features to improve the identifiability of feature representation and eliminate label noise.Unsupervised learning from scratch is difficult for models, so some researchers [21][22][23] have utilized unsupervised domain adaptation (UDA) methods to enable Re-ID models to learn identity-distinguishing features from unlabeled images.Dai et al. [22] proposed a dynamic task oriented de entanglement network (DTDN), which narrows the domain gap by establishing task-relevant and eliminating task-irrelevant relationships between the target and source domains.Wei et al. [23] proposed a domain encoder based on Transformer, which directly introduces domain information into the network to generate more robust domain-specific feature representations.Recently, MAE have been proposed for pre-training in a self-supervised manner, which achieves astonishing performance in various downstream tasks.Inspired by MAE, our motivation is to explore a robust MAE pre-training method suitable for downstream unsupervised vehicle Re-ID tasks.

Background Segmentation
Accurately segmenting the main objects and background elements in a given image is crucial in computer vision.To enable the segment models to segment specific objects, some researchers [24][25][26] consider that the model should be provided with certain prompt information as guidance.Wu et al. [24] proposed a hierarchical modular attention network (HULANet) to achieve distribution alignment of text and image prediction through a text description driven attention mechanism.Xie et al. [26] used natural language and image features to jointly constrain the predicted object region, achieving more accurate segmentation results by establishing connections between the object, background, and text.With the popularity of large-scale training in the field of computer vision, segmentation of large models has also been proposed, such as SAM and SegGPT [27].Among them, SAM achieved impressive zero-shot performance by building a three-stage data engine and training on masks exceeding 1B.The powerful segmentation performance of SAM can easily be migrated to the background segmentation task in vehicle images.However, some difficulties such as occlusion and blurring in the vehicle Re-ID tasks may result in incorrect segmentation results for SAM.Therefore, how to enable SAM to provide more accurate background and identity segmentation information in the vehicle Re-ID task is also the key issue to be considered in this paper.

Background-based Vehicle Re-ID
Due to the vehicle Re-ID is a cross-scene image retrieval task, vehicles with the same identity may suffer varying degrees of background interference in different scenes.Some works [4,28,29] deem that background information interference should be eliminated before the image is input into the network.Peng et al. [28] proposed a cross-camera adaptation framework (CCA), which utilizes StarGan to transfer cameras to the dataset and reduce the impact of background information on identity feature learning.Khoramshahi et al. [29] subtracted the original image from the non fine-grained information image generated based on Variational AutoEncoder to obtain a vehicle image that removes background interference and highlights salient information.Recently, some new methods [5,6,30] have achieved excellent performance in separating background information interference at the feature level.Lu et al. [6] extracted background-unrelated global features by jointly considering token features of the original image and semantic features based on vehicle masks.Zhu et al. [30] subtracted the global feature similarity from the background feature similarity based on camera ID during the similarity measurement phase to eliminate similarity bias caused by background information.The aforementioned methods only reduce background interference in the retrieval process by filtering background regions without considering the varying degrees of interference in different scenarios.The purpose of this paper is to design a novel meta-learning method that allows the Re-ID model to adapt to various degrees of background interference.

Overview
This section designs a SAM-driven MAE and backgroundaware meta-learning method for unsupervised vehicle Re-ID.The overall workflow of the proposed method consists of three modules, namely, space-constrained vehicle background segmentation, SAM-driven MAE pre-training, and background-aware meta-learning for unsupervised vehicle Re-ID, as presented in Fig. 1.In the first module, all unlabeled training samples are injected to SAM to obtain preliminary background segmentation results.Considering the unstable segmentation performance of SAM on occluded and blurred samples, we calculate space-constrained scores to optimize all segmentation results.For the SAM-driven MAE pre-training module, the optimized segmentation results are used as a guide to randomly preserve the identity patches.The whole pre-training process is conducted in a self-supervised manner; the masked image with some preserved patches is encoded using E M AE , and unmasked patches is decoded using D M AE to reconstruct image.Pre-training loss L M SE is utilized to ensure the quality of reconstructed images.In the downstream unsupervised vehicle Re-ID task of the third module, encoder E M AE serves as the baseline for extracting features from the training set and inputting them into DBSCAN [31] for clustering to obtain corresponding pseudo labels.Subsequently, the training set is dynamically divided into meta-train and meta-test sets on the basis of the range of the background region ratio.The whole meta-learning process utilizes the parameters of meta-train model E T R as the initial parameters of meta-test model E T E and receives supervision for losses L T R and L T E .

Space-constrained Vehicle Background Segmentation
With the rise of SAM, segmentation tasks based on zeroshot have become available through low-cost prompts, such as bounding boxes, without the need to train the specific segmentation model on a particular dataset.However, SAM cannot directly obtain precise segmentation results because of the low resolution, blurriness, and occlusion in the vehicle Re-ID dataset.Thus, a space-constrained vehicle background segmentation method is proposed in this paper to provide precise patch-based segmentation results for downstream Re-ID tasks.SAM is employed to roughly divide all patches of the image into vehicle identity and background information regions and further constrain and optimize the segmentation results by considering the spatial correlation between the two regions.The detailed process of the proposed method is shown in Fig. 2.
First, pixel-level background segmentation mask M ask SAM ∈ R H×W is obtained by inputting original image I ∈ R H×W ×3 and the corresponding bounding box prompt into SAM.Second, the division rule of patches is defined to obtain patch-based mask M ask p SAM ∈ R P H × P W (i.e., when more than half of the pixels in the patch are located in the vehicle identity region, the patch is considered a vehicle identity patch).To effectively mine the spatial correlation between patches, we extract the feature set f G ∈ R N ×D of all patches from original image I, where N = (H × W )/P 2 represents the total number of patches and D is the dimension of the feature.Based on patch-level segmentation labels for M ask p SAM , f G can be divided into background-patch feature set f B and vehicle-patch feature set f V .We compute the cosine similarity between two patch feature sets to obtain similarity matrices M V −V , M B−V , and M B−B , which can be formulated as Eq.1:  where V and B refer to the vehicle and background, respectively; M X−Y represents the similarity matrix between Xand Y -patch feature sets, that is, X, Y ∈ (V, B); cos(•) and trans(•) refer to the cosine similarity calculation and matrix transpose operation, respectively; and N V and N B are the patch numbers of the vehicle and background, respectively.
These similarity matrices are implemented by mean operations in the column dimension to obtain S V −V , S B−V , S V −B , and S B−B .S V −B is obtained by a similar operation after transposing matrix M B−V .S X−Y indicates the proxy similarity score of each element in the X-patch feature set and the entire Y -patch feature set.On the basis of the four proxy similarity scores, each patch in the image is compared using the similarity between the vehicle and background regions to determine which region it should be in.Score S X is calculated by subtracting S X−B from S X−V to facilitate a score comparison.S V and S B n the original patch order are merged to obtain space-constrained scores S. The detailed calculation process is expressed as Eq.2: After the values in S are obtained by subtracting S X−B from S X−V , the positive and negative situations of each value provide a basis for determining which region each patch should be in.Specifically, when the space-constrained score S i of the i-th patch is greater than 0, the patch is considered a vehicle patch; when S i is less than 0, the patch is considered a background patch; and when S i is equal to 0, the patch is a noise patch with the same similarity as the vehicle and background regions and treated as a background patch.Through this processing, optimized patch-based mask M ask p op is obtained, and it provides precise patch-based background segmentation information.

SAM-driven MAE Pre-training
Learning robust identity representations is crucial for unsupervised vehicle Re-ID tasks.However, existing unsupervised Re-ID models cannot easily separate identity-unrelated background information during the representation learning process.The main reason is that most models increasingly focus on background information errors during each iteration of training.Enhancing the sensitivity of Re-ID models to discriminative identity information is the key to solving the abovementioned problem.This section designs a SAM-driven MAE pre-training method that enhances feature extraction of vehicle identity regions through a SAM-guided pre-trained model based on MAE architecture.The pre-trained model has high sensitivity to vital vehicle identity information in downstream tasks, and its detailed process is illustrated in Fig. 3.
In the pre-training encoding step, given image I is divided into N patches of size P 2 , and patch embedding is performed.Assuming that the obtained embeddings are directly inputted into MAE, MAE's random masking strategy ensures that all patches have the same possibility of being preserved.When the background patch is preserved, the encoder may learn background-related interference information.This paper optimizes the original random masking strategy of MAE by using the optimized patch-based mask M ask p op obtained in Section 3.2 as guide information to randomly preserve partial vehicle patches.Given that the number of vehicle patches is N ′ V , the preserved ratio is set to γ.Our masking strategy selects a total of N ′ V × γ patches during each iteration process and inputs the preserved patches into the encoder E M AE for encoding operations to obtain the corresponding encodings.
In the decoding step, the preserved encodings are restored to the position of the corresponding patches.The positions of the previously masked patches are supplemented by the same learnable mask token.After positional embeddings where I i and I i rec are the i-th pixels of the original and reconstructed images, respectively.
In the whole self-supervised pre-training process, decoder D M AE performs contextual semantic inference with high correlation on the basis of the given vehicle patch encodings, and the ground truth continuously corrects the reconstruction results.This process makes encoder E M AE sensitive to vehicle identity information, thus providing a robust pre-trained model that distinguishes between identity and identity-unrelated background information for downstream unsupervised vehicle Re-ID tasks.

Background-aware Meta-learning for Unsupervised Vehicle Re-ID
Although existing unsupervised vehicle Re-ID methods have impressive performance, they still suffer from varying de-grees of background interference caused by scene variations.
The region ratio of the same identity vehicle body in 2D pixel space varies because of varying degrees of background interference.The reduced sensitivity of unsupervised Re-ID models to background variations leads to considerable differences in intraclass features, thereby reducing the accuracy of feature learning.This paper proposes a background-aware meta-learning approach that splits the original training set into meta-train and meta-test sets in accordance with varying background interference.The degree of background interference is simulated by calculating the ratio of the background region of vehicles in each image.The proposed meta-learning learns background-invariant features, and it consists of four steps: meta-set split, meta-train, meta-test, and meta-optimize.Meta-sets split.Given training set U , this paper uses DBSCAN to generate pseudo labels for it.To adjust the Re-ID model learning to different degrees of background interference, we simulate completely different background interference distributions in the meta-train and meta-test sets.On the basis of the optimized patch-based mask M ask p op (obtained in Section 3.2) of all images, the ratio of the background region r ∈ (0, 1) in the corresponding image is computed.The ratio of the background region is split into an average of 10 intervals (every 0.1 represents an interval), and all images in the training set are divided into 10 subsets depending on which interval r is in.
As shown in Fig. 4(a), the background region of most of the images in the vehicle Re-ID dataset is concentrated in few intervals.Direct random division based on the intervals may result in an extremely unbalanced number of images in the meta-train and meta-test sets.A balanced split strategy is adopted in this paper, as shown in Fig. 4(b): First, the two subsets with the largest number of images are randomly split into meta-train and meta-test sets.Second, the other subsets are randomly divided into meta-train set one by one until the number of images in the meta-train set exceeds half of the total number of images.Last, all remaining subsets are allocated to the meta-test set.
Meta-train.In the meta-train step, encoder E T R uses the pre-trained model E M AE in Section 3.3 for parameter initialization, samples the meta-train set, and employs E T R to compute the meta-train loss.The proposed method uses triplet loss L T ri and cross-entropy loss L CE with label smoothing as total loss L T R at the meta-train stage to improve model performance.The computation process can be formulated as Eq.4: where d p and d n represent the distance of positive and negative sample pairs in the mini-batch, respectively, and α is the margin of triplet loss.ỹi = βy i +(1−β)v represents constant β label smoothing for pseudo-label y i , v is a uniform vector, and q i is the classification prediction for the image.
Meta-test.In the meta-train step, parameters θ T R of E T R are used to construct a temporary model E T E with meta-train loss L T R .Parameters θ T E of E T E can be obtained from Eq. 5: where lr is the learning rate.Then, E T E is employed to calculate meta-test loss L T E for the images sampled in the meta-test set, similar to Eq. 4.

Meta-optimize.
Overall optimization of the model can be achieved based on the learning and adaptation of the model to different background interference tasks in meta-train and meta-test flow.The total loss and overall model parameter updates are shown in Eq. 6 and Eq. 7, respectively.
The aforementioned process constructs meta-train and meta-test tasks with different degrees of background interference.It continuously motivates the Re-ID model to adapt to different degrees of background interference during iterative training and learn other robust background-invariant features.Evaluation Protocols.Mean average precision (mAP) and cumulative matching characteristics (CMC) are employed to evaluate the performance of unsupervised vehicle Re-ID methods.mAP is a widely used evaluation metric in object detection tasks.It measures average precision by balancing accuracy and recall.CMC, on the other hand, focuses on the ranking-based performance of the model.It measures the accuracy of the top K matching results for a given query image.
In the experiments, mAP, Rank-1, and Rank-5 are calculated to compare the performance of the evaluated methods.

Implementation Details
In the space-constrained vehicle background segmentation step, a generalizable detection model with a small annotation cost is trained based on YOLOv8, which is employed to provide accurate bounding box prompts for SAM.In SAMdriven MAE pre-training, all samples are trained in 50 epochs, and the batch size is set to 64.For the unsupervised vehicle Re-ID downstream task, each image is augmented by random horizontal flipping, padding, cropping, and erasing.The total epochs of the Re-ID model are set to 60, with each epoch consisting of 600 iterations.Each iteration involves learning from a mini-batch of 64 samples, each containing four images for each of the 16 pseudo-classes.CLIP-B/16 [32] is used as the network encoder and participates in the steps in Sections 3.2 and 3.3.In both steps, all images are resized to 256×256, and the Re-ID model is updated by the Adam optimizer.Considering device limitations, we choose 60,000 images from the VeRi-Wild dataset and use them as the training set.All experiments are conducted with the Ubuntu18.04operating system and in Pytorch environment with 4 Tesla P40.

Ablation Study
Impact of different preserved rate.Table 1 presents the effects of different preserved rates on the downstream Re-ID tasks.The preserved rate in this paper is calculated based on the number of preserved patches in the vehicle identity region of the segmentation result.The experimental results indicate that as the preserved rate increases, the performance of the Re-ID model gradually decreases.The purpose of SAMdriven MAE is to provide preserved vehicle patches and allow the encoder-decoder architecture to learn information about masked vehicle patches.Therefore, the pre-training model can only learn a small amount of vehicle identity information when massive vehicle patches are preserved.In this case, the pre-training model becomes susceptible to background interference because the masked patches contain abundant background information.Overall, the preservation of massive vehicle patches during pre-training limits the feature learning effectiveness of downstream Re-ID tasks.Thus, the preserved rate of pre-training is uniformly set to 25% in the subsequent experiments.Effect of meta-learning strategy.As shown in Table 2, different attributes are used to replace the background region ratio in the proposed method when a meta-set split is performed to verify the effectiveness of the proposed method.The vehicle model and color attributes of each image are predicted by CLIP to ensure fairness in self-supervised learning.The attribute label sets is defined as: Table 3 Ablation study of the different baselines with proposed method on VeRi-776 and VeRi-Wild datasets.Among them, "TransReID-I" and "TransReID-D" respectively represent the TransReID baseline pre-trained on ImageNet and DeiT, "TMGF-L" represents the TMGF baseline pre-trained on Luperson, and "MAE-Random" represents the CLIP based baseline constructed in this paper.

Different Modules
VeRi-776 VeRi-Wild Test3000 Test5000 Test10000 Rank-1 Rank-5 mAP Rank-1 Rank-5 mAP Rank-1 Rank- Color: black, white, silvery, red, yellow, blue, green, golden, khaki, pink; Vehicle Model: sedan, bus, van, truck, hatchback, suv, mpv, jeep; Meta-learning methods based on the two attributes have not achieved notable results because of the considerable appearance differences inherent in images of different colors and models.A comparison of the two methods that uses the pixel-level background information and the proposed patchlevel background region ratio shows that the proposed method has better performance because the pixel-level background segmentation information obtained directly from SAM is inaccurate, and inaccurate segmentation information misleads the learning of the Re-ID model.According to the analysis above, the proposed method can effectively represent background interference information that is difficult to describe, so it can effectively help the Re-ID model adapt to background variations.
Baseline comparison.Due to the need for patch level features in the proposed method, a baseline based on ViT and CLIP is considered for performance comparison.Among them, TransReID [33] (modified to unsupervised architecture) and TMGF [34] are based on ViT, while "MAE-Random" is based on CLIP.As shown in Table 3, compared to ViT based baselines, the "MAE-Random" architecture baseline is more adaptable to the proposed method.This is because CLIP is a visual-linguistic pre-trained model.CLIP with advanced semantic information guidance is different from ViT that only focuses on visual information.It is easier to separate background information from identity information, thus achieving better performance.
Effect of different modules.To investigate the effectiveness of each module of the proposed method, we conduct ablation experiments on the performance of different modules, as shown in Table 4.The explanation for each module is as follows: (1) "MAE(Random)" indicates the use of MAE pretraining on the basis of random masks, followed by unsupervised downstream training of vehicle Re-ID.
(2) "Ours(w/o Bg-Meta)" indicates the use of SAM-driven MAE pre-training, followed by unsupervised downstream training of vehicle Re-ID.
(3) "Ours(w/o SAM-driven MAE)" indicates that without any MAE pre-training, unsupervised vehicle Re-ID with background-aware meta-learning is directly performed.
(4) "Ours(w/ Patch-Seg)" indicates the direct use of patchlevel segmented images as input in downstream training, using the entire method proposed.
(5) "Ours" refers to the use of all modules of the proposed method.
According to the experimental results, "Ours(w/o Bg-Meta)" has better performance on the VeRi-Wild dataset compared with the other methods.The Rank1 values of "Ours(w/o Bg-Meta)" are 3.8%, 3.8%, 3.5% higher in the three test sets compared with the Rank1 values of "Ours(w/o SAM-driven MAE)".This result demonstrates that the pro- posed SAM-driven MAE pre-training allows downstream Re-ID models to learn additional robust identity features.The results of "Ours(w/o SAM-driven MAE)" show relatively balanced performance in two datasets, proving that the proposed background-aware meta-learning method has excellent adaptability to datasets with different degrees of background interference.Observing the results of "Ours(w/Patch-Seg)", although directly applying patch level image segmentation results downstream can achieve the most direct separation of background information.However, due to the presence of high-dimensional spatial information in the background, directly removing the background from the image cannot achieve effective performance.Compared with the methods that use individual modules, the "Ours" method that employs all modules exhibits the best performance in all the evaluation indicators and test sets.This finding proves that the "Ours" method combines the advantages of adapting to being sensitive to discriminative identity information and adapting to varying degrees of background interference in the Re-ID model.

Analysis of unsupervised domain adaptation training strategies.
The training of the unsupervised domain adaptation (UDA) task for vehicle Re-ID is divided into two stages: supervised pre-training in the source domain and unsupervised fine-tuning in the target domain.The proposed method provides a robust pre-training model to make the Re-ID tasks focus on discriminative vehicle identity information.To explore the effectiveness of the proposed SAM-driven MAE pre-training in UDA tasks, we compare three pre-training strategies, as shown in Table 5.The specific explanation for each strategy is as follows: (1) "MAE(S)" indicates a training strategy of conducting SAM-driven MAE pre-training in the source domain, followed by supervised learning in the source domain and unsupervised fine-tuning in the target domain.
(2) "MAE(T)" refers to a training strategy that involves supervised learning in the source domain, followed by SAMdriven MAE pre-training in the source domain and unsupervised fine-tuning in the target domain.
(3) "MAE(S+T)" indicates a training strategy of using images from the source and target domains for SAM-driven MAE pre-training, followed by supervised training in the source domain and unsupervised fine-tuning in the target domain.
According to the experimental results, the "MAE(T)" strategy performs much better than the "MAE(S)" strategy does.However, for the "VeRi-Wild→VeRi-776" task, the "MAE(S)" strategy has a higher Rank-5 value and mAP than the "MAE(T)" strategy because the VeRi-Wild dataset has many images and complex vehicle information.The selfsupervised pre-training conducted on the VeRi-Wild dataset enables the Re-ID model to learn abundant robust feature representations on the relatively simple VeRi-776 dataset.The comparison of these experimental results proves that self-supervised SAM-driven MAE pre-training before supervised training hinders the Re-ID model from adapting the information learned in the source domain to the target domain.However, after supervised training in the source domain, performing SAM-driven MAE pre-training in the target domain can effectively convey the information learned by the model in the source domain.This finding also indirectly confirms that the proposed SAM-driven MAE pre-training can alleviate the domain gap in UDA tasks.

Comparison with State-of-the-arts
Existing state-of-the-art methods are compared with the proposed method in Table 6.The proposed method is superior to the other methods in all evaluation indicators.MetaCam [39] employs a meta-learning strategy to overcome camera variations by using camera annotations.Compared with Meta-Cam, the proposed method fully considers the interference caused by background variations without using any annotation information, and it outperforms MetaCam by 8.8% on VeRi-776 and 2.8% on VeRi-Wild (Test3000) in terms of mAP.These results prove the effectiveness of the proposed metalearning method and its low manual annotation dependency.Compared with the currently best-performing methods of GroupSampling [36] and GCMT [42], the proposed method has better performance on VeRi-776 and VeRi-Wild datasets, respectively.The key reason is that the two methods focus on Table 6 Comparison of the proposed method with state-of-the-art methods on VeRi-776 and VeRi-Wild datasets.
The latest methods for some UDA tasks are also compared with the proposed method, as shown in Table 7.The proposed method surpasses the other methods by large margins regardless of whether the target domain is VeRi-776 or VeRi-Wild.Specifically, on the VeRi-776 dataset, the Rank-1 and mAP of the proposed method are 7.7% and 17.9% higher than those of the best performing method AWB [44], respectively.In the case of VeRi-Wild (Test 3000), the Rank-1 and mAP of the proposed method are 1.4% and 2.3% higher than those of MMT [41], respectively.AE [43] and GLT [45] methods optimize representation learning in the latent space to reduce label noise and domain differences.However, the abstract nature of representation learning can be difficult to control in iterative training, making it challenging for the model to accu-rately capture discriminative identity information.To address this, our proposed method utilizes SAM to provide efficient and precise background guidance, increasing the model's sensitivity to identity information and improving overall performance.Additionally, after comparing the performance of the methods on UDA and USL tasks, we observe a massive improvement on VeRi-776.This improvement indicates that our method can effectively learn robust identity information and prompts the pre-training model to apply the knowledge learned from large-scale datasets to downstream UDA tasks.

Qualitative Analysis
Visualization of the segmentation result.As shown in Fig. 5, the pixel-level segmentation results obtained by SAM for different scenes (i.e., (a), (b), (c), and (d)) are inaccurate, leading to incorrect guidance for downstream tasks.As indicated in the fourth column of Fig. 5, the proposed space-constrained vehicle background segmentation method can be optimized based on the segmentation results of SAM, further distinguishing between vehicle and background information.For example, in Fig. 5(c), our method corrects the result of SAM that mistakenly divides shrubs and trees into vehicle regions.Visualization of the rank list.The retrieval results of the four methods in Table 4 are visualized to reveal the effectiveness of the proposed method intuitively.In Fig. 6, the top 5 rank list results for the corresponding query are given.According to the rank lists, the "MAE (Random)" method that does not consider background information cannot distinguish similar structures in an example with different identities.For Query A in Fig. 6(a), due to the inability to identify the same structure of vehicles with different identities passing through shrubs, the "MAE (Random)" method mistakenly identifies the top three candidate samples as positive samples.Compared with the methods that use individual modules, the "Ours" method that employs all modules retrieves the top five of the rank lists correctly, thus alleviating this problem to varying degrees.In terms of Query B in Fig. 6(b), the "Ours" method can accurately identify positive samples in the background of pedestrian interference.For Query C in Fig. 5(c), the the proposed method prioritizes the intricate details of the vehicle region.This enables it to effectively mitigate variations in the background environment resulting from lighting changes and accuracely retrieve the top five positive samples.

Visualization of T-SNE.
The feature learning ability of the proposed method in qualitative analysis is also assessed.Twenty classes in the training set of VeRi-776 are randomly selected, and their feature distributions are visualized.As shown in Fig. 7, compared with the "MAE(Random)" method, the "Ours(w/o Bg-Meta)" method makes the Re-ID model more sensitive to vehicle appearance information, thereby effec-tively widening the distance among various classes.However, due to the effects of background variations, the "Ours(w/o Bg-Meta)" method still maintains a large intra-class distance.Compared with "Ours(w/o Bg-Meta)", "Ours(w/o SAM-driven MAE)" makes the Re-ID model adapt to varying degrees of background interference, thus remarkably reducing the distance within each class.When the "Ours" method that combines the advantages of two modules is used, a reliable feature distribution is obtained.For example, the red circles in the "Ours" method have small distances inside, but the distances between circles are large.Visualization of reconstructed effects.We compare the reconstruction effects of "MAE (Random)" and "SAM-driven MAE", as illustrated in Fig. 8.The "SAM-driven MAE" method is more accurate than the "MAE (Random)" method in reconstructing fine-grained information about vehicles.As shown in Fig. 8(a) and (b), the "MAE (Random)" method produces blurrier results compared with the proposed method in the reconstruction of vehicle profiles.As indicated in Fig. 8(c), the "SAM-driven MAE" method still reconstructs the vehicle contour for the patches obstructed by trees in the original image, but the "MAE (Random)" method is ineffective.These situations indirectly confirm that our method can effectively provide a robust pre-training model that can distinguish between background and discriminative identity information.

Discussion on method complexity
The proposed method employs SAM to obtain low-cost background segmentation information, which guides the model to perform two-stage background information separation learning: SAM-driven MAE pre-training and Background-Aware Meta-learning.As shown in Table 8, compared to current methods, the proposed method utilizes extra end-to-end pretraining and has higher complexity.However, as a reward, the proposed method has a certain level of background-aware ability and has achieved more competitive performance.

Discussion on Person Re-ID
The performance of unsupervised person Re-ID methods is similarly constrained by image background factors.To verify the proposed method's universality and generalization ability, we compared it with the latest approaches in the field of unsupervised person Re-ID.The experiments were conducted on the Market-1501 dataset, and the specific results are shown in Table 9.Compared to the state-of-the-art method MetaCam, the proposed method achieved improvements of 3.8% and 19.5% in Rank-1 and mAP, respectively.Despite the more complex interference of background elements in pedestrian re-identification datasets, the proposed method still demonstrates competitive performance.This directly attests to the effectiveness of the proposed method in the task of unsupervised person Re-ID.
To better understand the resistance of the proposed method to background interference in unsupervised person Re-ID tasks, we visualized the focal regions of model features and compared them with existing methods.As shown in Fig. 9, compared to other methods, our approach makes the model more sensitive to the human body region and pays less attention to background areas lacking identity information.This is because our method utilizes pre-training and metalearning, separating identity-independent information from interfering with model representation learning.It effectively guides the model to focus on the unique areas of person images, resulting in the learning of more robust features.

Discussion on Supervised Re-ID
The effectiveness of the proposed has been further validated in the supervised vehicle re-identification task.Specifically, we replaced the real labels of the training set with the pseudo-labels generated through clustering in our proposed method.Subsequently, we compared the perfor-mance of this method with existing approaches, and the experimental results are shown in Table 10.In comparison with the top-performing methods, UMTS [53] and CAL [55], our proposed method achieved remarkable improvements of 11.9% and 13.5% in mAP, respectively.This indicates the insensitivity of our proposed method to task variations and its robust generalization capability.Furthermore, it underscores that in a supervised learning context without label noise interference, our proposed method can more effectively capture distinctive identity information.

Conclusions
We propose SAM-driven MAE pre-training and backgroundaware meta-learning for unsupervised vehicle Re-ID.A spaceconstrained vehicle background segmentation method is presented to obtain high-quality background segmentation results via SAM.To enhance the capacity to distinguish between background information and vehicle identity, we design SAMdriven MAE pre-training to learn identity-sensitive features for downstream unsupervised vehicle Re-ID tasks.For downstream unsupervised vehicle Re-ID tasks, background-aware meta-learning is proposed to enhance the sensitivity of the Re-ID model to varying degrees of background interference by using the background region ratios.Extensive experiments confirm that the proposed method can effectively alleviate the problem of background variations.In our future work, SAM-driven large-scale pre-training that adopts text prompt learning will be further explored and discussed to overcome the complexity of extra pre-training end-to-end.

Fig. 1
Fig. 1 Overview of the proposed method.

Fig. 2
Fig. 2 Detailed process of space-constrained vehicle background segmentation.

Fig. 4
Fig. 4 Visualization of the detailed information of the meta-train and meta-test sets' splitting strategy.(a) Proportion of images in the ratio range of each background region in the VeRi-776 and VeRi-Wild training sets.(b) Proposed meta-set splitting strategy.

Fig. 5
Fig. 5 Visual comparison of segmentation results obtained by directly applying SAM and the proposed method with various vehicle models in three different scenes: (a) complex vehicle structure, (b) blurred scene, (c) static occluded scene with shrub, and (d) dynamic occlusion scene with pedestrian.

Fig. 6
Fig. 6 Top 5 rank lists were retrieved for queries with various vehicle models by different ablation modules.The green and red boxes indicate positive and negative candidate samples, respectively.

Fig. 7 T
Fig. 7 T-SNE visualization of the feature distribution of different ablation modules.Each point of the same color belongs to the same class.

Fig. 8
Fig. 8 Visualization of the reconstructed images by "MAE (Random)" and "SAM-driven MAE" methods.The gray patches are preserved patches used for image reconstruction.Both methods preserve patches in the same position during image reconstruction to ensure a fair comparison.

Experiments 4.1 Datasets and Evaluation Protocols Datasets.
The overall training process of the proposed SAM-driven MAE pre-training and background-aware meta-learning for unsupervised vehicle Re-ID method is summarized in Algorithm.1. Procedure of proposed method.Input: Unlabeled training set U , bounding box prompt T , batch size b, segment anything model SAM , encoder E M AE , decoder D M AE .Randomly preserve some vehicle patches in u through M ask p SAM and input E M AE to obtain patch-feature encoding f M AE ; Fill f M AE with mask tokens and input D M AE to obtain the reconstructed image; Compute L M SE with Eq. 3; Update parameters for E M AE and D M AE based on L M SE ; end for //Background-aware meta-learning for Re-ID; Generate pseudo-labels for U with DBSCAN; Split U into U T R and U T E ; for image iter in train i ters do Samples mini-batch with b from U T R and U T E to obtain u T R and u T E , respectively; Build E T R using pre-trained E M AE parameters and performing meta-train flow; Build E T E and performing meta-test flow; Optimize θ T R with gradient computed by Eq. 7; end for θ Extensive experiments are conducted on two widely used datasets: VeRi-776 and VeRi-Wild.The contents of the VeRi-776 dataset were collected from 20 cameras covering a real traffic monitoring area of 1 km 2 within 24 h.It has a total of 50,117 images of 776 vehicles, including 37,778 images in the training set, 1,678 images in the query, and 10,661 images in the gallery.VeRi-Wild is a large-scale vehicle Re-ID dataset.It contains 416,314 images of 40,671 vehicles that were obtained from 174 cameras that recorded images within a month.The images were captured under the influence of various environmental factors, such as backgrounds, lighting, viewpoints, and weather.The training set includes 277,794 images of 30,671 vehicles.VeRi-Wild divides the test set into three subsets.The small subset includes 41,816 images of 3,000 vehicles, the medium subset includes 69,389 images of 5,000 vehicles, the large subset includes 138,517 images of 10,000 vehicles.
for image iter in pre − train i ters do Sample mini-batch with b in U to obtain u; * ←θ T R ; Result: θ *4

Table 1
Performance comparison of SAM-driven MAE pretraining with different preserved rates in VeRi-776.

Table 2
Performance comparison of meta-learning strategies based on different attributes in VeRi-776.The meaning of the "Bg ratio" is the ratio of background region.

Table 4
Ablation study of the different modules on VeRi-776 and VeRi-Wild datasets.

Table 5
Comparison of different training strategies for UDA tasks.

Table 7
Comparison of the proposed method with state-of-the-art UDA vehicle Re-ID methods on source dataset → target dataset tasks.

Table 8
Comparison of complexity and performance on VeRi-776 dataset.The meaning of "B-A" is Background-Aware ability.

Table 9
Comparison with state-of-the-art unsupervised person Re-ID methods on datasets of Market-1501.
Fig. 9 Visualization of attention maps for features by different methods.

Table 10
Comparison with state-of-the-art unsupervised person Re-ID methods on datasets of Market-1501.
Yuhan Geng Yuhan Geng received the B.S. degree in Bioinformatics from The Chinese University of Hong Kong, Shenzhen, China in 2023.She is currently pursuing the M.S. degree at University of Michigan, Ann Arbor, United States.Her current research interests include computer vision.