MambaReID: Exploiting Vision Mamba for Multi-Modal Object Re-Identification

Multi-modal object re-identification (ReID) is a challenging task that seeks to identify objects across different image modalities by leveraging their complementary information. Traditional CNN-based methods are constrained by limited receptive fields, whereas Transformer-based approaches are hindered by high computational demands and a lack of convolutional biases. To overcome these limitations, we propose a novel fusion framework named MambaReID, integrating the strengths of both architectures with the effective VMamba. Specifically, our MambaReID consists of three components: Three-Stage VMamba (TSV), Dense Mamba (DM), and Consistent VMamba Fusion (CVF). TSV efficiently captures global context information and local details with low computational complexity. DM enhances feature discriminability by fully integrating inter-modality information with shallow and deep features through dense connections. Additionally, with well-aligned multi-modal images, CVF provides more granular modal aggregation, thereby improving feature robustness. The MambaReID framework, with its innovative components, not only achieves superior performance in multi-modal object ReID tasks, but also does so with fewer parameters and lower computational costs. Our proposed MambaReID’s effectiveness is validated by extensive experiments conducted on three multi-modal object ReID benchmarks.


Introduction
Object Re-identification (ReID) aims to re-identify the same object across different camera views.Due to its wide range of applications, object ReID [1][2][3] has advanced significantly in recent years.In particular, traditional object ReID mainly focuses on extracting discriminative information from easily accessible RGB images.However, various complex imaging conditions like darkness, strong lighting, and low image resolution may severely affect the quality of RGB images [4].The critical regions of objects become blurred [5], resulting in a loss of discriminative information.Fortunately, multi-modal object ReID [6][7][8] has shown significant potential in overcoming the challenges.By integrating complementary information from near-infrared (NIR), thermal infrared (TIR), and RGB images, multi-modal object ReID can provide more robust representations in complex scenes.Thus, it has attracted increasing attention in the past few years.
In order to aggregate heterogeneous information from different modalities, Li et al. [5] first propose a multi-modal vehicle ReID benchmark named RGBNT100, which contains RGB, NIR, and TIR images.Meanwhile, they propose a HAMNet to learn the robust representations with a heterogeneous coherence loss.Further, Zheng et al. [4] propose a PFNet to progressively fuse multi-modal features, along with the first multi-modal person ReID benchmark named RGBNT201.Wang et al. [9] employ three learning methods to enhance modality-specific knowledge with the IEEE framework.Then, Zheng et al. [6] introduce a DENet to tackle the modality-missing issue.With generative models, Guo et al. [10] present a GAFNet that fuses heterogeneous information.Meanwhile, with the generalization ability of Transformers [11], researchers begin to explore the potential of Transformers in multimodal object ReID.Pan et al. [12] construct a PHT, utilizing a feature hybrid mechanism to balance information from different modalities.Through analyzing the modality of laziness, Jennifer et al. [13] provide a strong baseline UniCat.Further, Wang et al. [7] introduce a new token permutation mechanism designed to enhance the robustness of multi-modal object ReID.From the perspective of test-time training, Wang et al. [14] propose a HTT to explore the information existing in the test data.Recently, Zhang et al. [8] construct an EDITOR to select diverse features and minimize the influence of background noise.Although these methods achieve promising results, they still have some limitations.
For CNN-based methods, their performance is hindered by limited receptive fields, making them difficult to capture global information from heterogeneous modalities.As for Transformer-based methods, while they exhibit superior performance, the quadratic computational complexity [11] introduced by attention mechanisms is unacceptable.Additionally, the lack of convolutional inductive bias [15] in Transformer-based methods results in weaker perception of local details, leading to the neglect of certain information among different modalities.Thus, efficient integration of global and local information becomes crucial for multi-modal object ReID.However, existing methods may fail to exploit the complementary advantages of the above frameworks.On the one hand, the features extracted by existing methods lack sufficient robustness.On the other hand, the most effective current approaches, which employ highly complex Transformer models, are not efficient.Therefore, there is an urgent need for an efficient method capable of extracting robust features.
Recently, Mamba [16] has drawn significant attention [17] due to its superior scalability with State Space Models (SSM) [18].Empowered by the Selection Mechanism (S6), Mamba surpasses alternative state-of-the-art architectures, like CNNs or Transformers, in managing long sequences with linear complexity.Furthermore, Mamba has been successfully applied to various computer vision tasks [19][20][21], such as image classification, object detection, and video understanding.To be specific, VMamba [19] integrates S6 with a four directions scanning mechanism, which can fully capture the global context information and local details.Drawing inspiration from VMamba's exceptional performance in image classification tasks, we introduce a novel fusion framework named MambaReID, specifically designed for multi-modal object ReID.This innovative approach leverages the strengths of various modalities to enhance the re-identification process, aiming to significantly improve accuracy and efficiency in multi-modal object ReID scenarios.
Technically, our proposed MambaReID consists of three main components: Three-Stage VMamba (TSV), Dense Mamba (DM), and Consistent VMamba Fusion (CVF).More specifically, TSV is designed to extract robust multi-modal representations with four directions scanning mechanism.Instead of directly transferring VMamba, we observe that its final stage downsampling leads to substantial detail loss and computational redundancy.For tasks like multi-modal ReID, high-resolution feature maps provide more details for subsequent modality fusion.Hence, we opt to skip the final stage and adopt a last stride technique, as described in BoT [22].This adaptation allows TSV to preserve richer details while reducing computational overhead, resulting in more robust outcomes compared to Transformers.Then, we introduce a DM into the TSV to further enhance the discriminative ability.Dense connections are crucial for fine-grained image classification tasks [23], as they offer semantic information at different levels, thereby enhancing the robustness of features.With the dense connections, DM can fully integrate the inter-modallity information with shallow and deep features.Different from previous dense connections, we only introduce that in the last stage of TSV with a small computational overhead.Hence, MambaReID can retrain more fine-grained details with less computational cost.Finally, to effectively integrate information from multiple modalities, we introduce the CVF.CVF incorporates a consistent loss function to align deep features across different modalities.This loss ensures that features from different modalities are well-aligned, facilitating effective fusion.Aligned features are then concatenated along the channel dimension and processed by the VMamba block for modality integration.This step enables simultaneous integration of information from multiple modalities at identical spatial positions, thereby enhancing the granularity of modal aggregation.Through our method, CVF ensures the effective utilization of wellaligned multi-modal images.Finally, with our proposed method, MambaReID can provide more robust multi-modal features for multi-modal ReID.Extensive experiments on three public benchmarks demonstrate that our MambaReID can achieve superior performance compared to most state-of-the-art methods.
In summary, our contributions are as follows:

Proposed Methodology 2.1. Overall Architecture of MambaReID
As shown in Figure 1, our MambaReID consists of three main components: Three-Stage VMamba (TSV), Dense Mamba (DM) , and Consistent VMamba Fusion (CVF).We employ the TSV as the backbone.It is designed to extract robust single-modal representations from RGB, NIR, and TIR modalities.DM is introduced at the final stage of TSV to further improve network discriminative capability.Additionally, with the well-aligned multi-modal images, CVF is employed to enhance the modal aggregation granularity.Detailed descriptions of our proposed modules are provided in the following sections.

Preliminary
State Space Models (SSMs).SSMs [24,25] have been widely employed across diverse sequential data modeling tasks.Inspired by continuous systems, SSMs adeptly capture the dynamic patterns inherent in input sequences.To be specific, SSMs can map the input x(t) ∈ R L to the output y(t) ∈ R L , which is expressed as: where A ∈ C N×N , B ∈ C N and C ∈ C N are the model parameters, while the D ∈ C is the residual term.Here, N is the dimension of the hidden state h(t).The above equations can easily model the continuous input.However, in the context of a discrete input like image or text, SSM needs to be discretized with the zero-order hold (ZOH) method.To specify, where ∆ ∈ R D represents the predefined timescale parameter that maps the continuous parameters A and B into a discrete space, the discretization process is described as: where B, C ∈ R D×N are the discretized matrices.After discretization, the SSM is calculated as follows: Finally, with the discretized SSM, we can model the discrete image with linear complexity.Selective Scan Mechanism.SSM can be utilized for efficient sequence modeling.However, the conventional SSM may fail to capture the complex patterns in various input sequences.Without a data-dependent structure, SSM lacks the ability to focus on or ignore specific information.To solve this issue, Gu et al. [16] present a Selective Scan mechanism for SSM (S6), where the matrices B ∈ R L×N , C ∈ R L×N , and ∆ ∈ R L×D are generated by the input data x ∈ R L×D .This enables S6 to fully perceive the contextual information of the input, rendering it more flexible and efficient.
2D Selective Scan.As shown in the first image of Figure 2, the S6 usually scans the input sequence in one direction.However, in the context of visual tasks, the sequences we encounter often consist of non-causal image data.Thus, the unidirectional scanning method employed in S6 is deemed impractical.Within single direction scanning, the current image patch can only perceive the information from the previous patches, instead of the local information from different directions.To address this issue, VMamba [19] introduces a 2D Selective Scan (SS2D) mechanism.To be specific, SS2D scans the input sequence in four directions: left-to-right, right-to-left, top-to-bottom, and bottom-to-top.As shown in Figure 2, the patch in a specific region can perceive the information from adjacent patches in different directions, which can enhance the feature discriminability with contextual information.Thus, in our MambaReID, we employ the SS2D to extract robust multi-modal representations.Specifically, we only need to unfold the images in different sequences.
After scanning, we restore them according to their original relative positions, and then sum the scanning results from the four directions at the same position.

Three-Stage VMamba
To fully exploit the discriminative information with low computational complexity, we first introduce the VMamba as our backbone.Generally, the VMamba is composed of four stages with three downsampling operations.However, the downsampling of the last stage in VMamba leads to a significant loss of detailed information.Moreover, the final stage introduces redundant computational costs.Therefore, we eliminate the last downsampling like BoT [22] to preserve richer details.Further, we directly remove the final stage to achieve more efficient modeling.Thus, we propose the Three-Stage VMamba (TSV) as our backbone, as shown in Figure 1.
To be specific, we denote the input multi-modal images as X = {X RGB , X NIR , X TIR ∈ R H×W×C }, where H, W, and C denote the height, width, and channel of the images, respectively.For illustrative purposes and without loss of generality, we consider the RGB modality as a representative example.The RGB images X RGB are first fed into the Stem block with a convolutional layer to extract the initial features Then, F RGB are fed into the VSS block to integrate the global and local information.Technically, the VSS block consists of LayerNorm (LN) [26], Linear transformation, Depthwise Convolution (DWConv) [27], SS2D and SiLU [28] activation functions.As shown in the right corner of Figure 1, the VSS block can be formulated as: ) VSS(X) = Linear(SiLU(Ω(X)) × SiLU(Ψ 2 (X))).
Then, with the residual connection, we fed the output of the current VSS block into the next VSS block as follows: where X l denotes the output of the l-th VSS block.Thus, after the first stage of TSV, we can obtain the features Then, the features F 1 RGB are fed into the next stage for further processing.Similarly, we can obtain the features Finally, with the last stage of TSV, we obtain the features Similar to the RGB modality, we can also extract the features F 3 NIR and F 3 TIR .Then, we employ the pooling (P) on F 3 RGB , F 3 NIR and F 3 TIR , respectively.Finally, we concatenate the pooled features to obtain the multi-modal features f tsv as follows: where [•] is the concatenation operation.After applying the loss supervision on f tsv , we can extract the robust multi-modal representations with TSV.

Dense Mamba
Dense connections [29] have been extensively utilized in various computer vision tasks.They can effectively enhance the feature discriminability by fully integrating the information with shallow and deep features.However, directly introducing dense connections into the TSV will lead to a significant increase in complexity.Additionally, the effectiveness of dense connections in VMamba remains unclear.Therefore, we explore the potential of dense connections in VMamba and introduce the Dense Mamba (DM) to the last stage of TSV.As shown in the right corner of Figure 1, we simplify the dense connections by directly adding the previous features to the current features.To be specific, the DM can be formulated as: where l denotes the l-th VSS block in the last stage of TSV.With the simple dense connections, we can fully integrate the information from different levels within a single modality.Meanwhile, the computational overhead is extremely low.

Consistent VMamba Fusion
To fully exploit the complementary information from different modalities, we introduce the Consistent VMamba Fusion (CVF).With well-aligned multi-modal images, we employ consistent loss to integrate the deep features across modalities at the same spatial positions.Furthermore, we concatenate the aligned features along the channel dimension and feed them into the VSS blocks to achieve modality fusion.The detailed process of CVF is illustrated in the right corner of Figure 1.We first directly copy the features F 3 RGB , F 3 NIR and F 3 TIR to obtain F R , F N and F T , respectively.Then, we utilize a linear transformation to reduce the channel dimension of F R , F N and F T to 3 : Then, we employ the consistent loss to align the features T with the following equations: We end up with the consistency constraint loss L C , which can be represented as: where N p is the number of patches.After obtaining the aligned features, we concatenate them along the channel dimension to obtain the multi-modal features 16 ×C 3 as follows: Then, we feed the F ′ 0 into the stacked K layers of VSS blocks: where F ′ l denotes the output of the l-th VSS block in the CVF.
Finally, we pool the output of the last VSS block to obtain the features f cv f ∈ R C 3 .Subsequently, the features f cv f are sent for loss supervision.With the fully integrated multi-modal features, we can obtain more discriminative representations.

Objective Function
As depicted in Figure 1, our objective function comprises three parts: losses for the TSV backbone, CVF, and the consistency constraint loss L C .Both the backbone and CVF are supervised by the label-smoothing cross-entropy ID loss L ID [30] and triplet loss L Triplet [31]: The overall objective function is defined as: where λ is the hyperparameter used to balance L C .By minimizing the L, our MambaReID can generate more discriminative multi-modal features with low computational complexity.

Dataset and Experimental Setup
We utilize three multi-modal object ReID benchmarks to evaluate the performance of our proposed MambaReID.Particularly, RGBNT201 [4] is the inaugural multi-modal dataset for person re-identification, which includes RGB, NIR, and TIR modalities.RGBNT100 [5] is a large-scale multi-modal vehicle ReID dataset, while MSVR310 [32] is a smaller-scale multi-modal vehicle ReID dataset featuring complex visual scenes.Regarding the evaluation metrics, we align with prior studies and adopt mean Average Precision (mAP) and Cumulative Matching Characteristics (CMC) at Rank-K (K = 1, 5, 10).Additionally, we report the trainable parameters and FLOPs for complexity analysis.

Implementation Details
We leverage pre-trained VMamba [19], sourced from the ImageNet classification dataset [33], as the backbone of our architecture.The images of RGBNT201 were resized to 256 × 128 and the images of RGBNT100 and MSVR310 datasets were resized to 128 × 256 during data processing.In the training process, we enhance the robustness of our model by applying data augmentation techniques such as random horizontal flipping, cropping, and erasing [34].The training process is managed with a mini-batch size of 64, consisting of eight randomly selected identities, each providing eight images.Optimization of our model is achieved through the use of Stochastic Gradient Descent (SGD) with a momentum of 0.9 and a weight decay of 0.0001.The learning rate is initially set to 0.01 and a cosine decay warm-up strategy is applied.The hyperparameter λ in L is set to 1.For the CVF, we set the K to 1.During testing, we concatenate the features f tsv and f cv f to obtain the final multi-modal features for retrieval.Specifically, each query consists of three paired images: RGB, NIR, and TIR.The model takes these three modalities as input and extracts the final retrieval vector according to the network structure, which serves as the feature of the query.A similar process is applied to each triplet in the gallery.Finally, by ranking the similarity between the features of the query and those in the gallery, the model determines whether the object is correctly matched.The proposed method is implemented using the PyTorch framework and experiments are conducted on a single NVIDIA A100 GPU (NVIDIA, Santa Clara, CA, USA).1a, we conduct comprehensive comparisons with dominant single-modal and multi-modal approaches on RGBNT201.Typically, single-modal approaches perform poorly due to the lack of specialized designs for multi-modal fusion.Among single-modal methods, PCB demonstrates a notable mAP of 32.8%.For multi-modal methods, CNN-based frameworks demonstrate inferior performance compared to Transformer-based ones due to their limited receptive fields.With the strong generalization ability of Transformers, HTT [14], TOP-ReID [7] and EDITOR [8] achieve superior performance.Specifically, TOP-ReID (A) achieves a mAP of 72.3%, surpassing HTT by 1.2%.However, the high computational complexity of Transformers is unacceptable.Based on VMamba, with only 18.32% of the TOP-ReID's parameters, our MambaReID achieves a mAP of 72.2%.Additionally, our MambaReID outperforms most CNN-based and Transformer-based methods across various settings (A/B settings), thereby validating the efficacy of our approach.

Multi-modal Person ReID. As reported in Table
Table 1.Comparative analysis of three multi-modal object ReID benchmarks, with the top and second-best performances highlighted in bold and underlined, respectively.The symbol * represents the Transformer-based methods, while † denotes the Mamba-based methods.For the rest of the methods, they are CNN-based methods.The A and B denote the different settings from TOP-ReID [7].Multi-modal Vehicle ReID.As depicted in Table 1b, we compare our MambaReID with mainstream methods on RGBNT100 and MSVR310.In single-modal methods, BoT [22] achieves a mAP of 78.0% on RGBNT100.Transformer-based method TransReID [45] underperforms CNN-based methods due to the lack of convolutional inductive bias.Especially on the small-scale dataset MSVR310, TransReID achieves a mAP of 18.4%, which is 10.5% lower than AGW [1].However, in multi-modal methods, Transformer-based methods like TOP-ReID [7] and EDITOR [8] exhibit superior performance in integrating multi-modal information.Focusing on mitigating the influence of irrelevant background, EDITOR achieves a mAP of 82.1% on RGBNT100.With only 50.2% trainable parameters of EDI-TOR, our MambaReID achieves a competitive mAP of 78.6% on RGBNT100.Additionally, on MSVR310, our MambaReID (B) achieves a mAP of 46.1%, surpassing EDITOR (B) by 7.1%.Overall, these results fully demonstrate the effectiveness of our proposed method.

Ablation Study
Effect of different components.In Table 2  Parameter analysis.In Table 3, we compare the trainable parameters of our Mam-baReID with mainstream methods on RGBNT100.Compared to TOP-ReID [7] and EDI-TOR [8], our MambaReID achieves a competitive mAP of 78.6% on RGBNT100 with only 59.47 M parameters.Additionally, our MambaReID outperforms most CNN-based methods with a similar number of parameters.This clearly illustrates the efficiency and effectiveness of our proposed approach.Effect of different backbones.In Table 4, we compare the performance of different backbones on RGBNT201.ViT achieves a mAP of 63.18%, while directly using VMamba achieves a mAP of 55.98%.This result indicates that the final stage downsampling of VMamba leads to substantial detail loss.With the last stride trick in BoT [22], VMamba ‡ achieves a mAP of 58.47%.However, the high computational complexity of VMamba ‡ is unacceptable.Thus, we directly drop the final stage of VMamba and adopt a last stride technique, resulting in a mAP of 63.81%.With only 63.96% of the parameters and 79.22% FLOPs of ViT, our TSV achieves a mAP of 63.81%, surpassing ViT by 0.63%.Finally, we employ the TSV as our backbone in subsequent experiments.These outcomes affirm the efficacy of our proposed backbone structure.Effect of different depths of CVF.In Table 5, we evaluate the performance of different depths of CVF on RGBNT201.With the increase in depth, the performance of CVF gradually decreases.This result indicates that a deeper CVF may introduce more irrelevant information, leading to performance degradation.Hence, we opt to use a CVF with a depth of 1 in the other experiments.Effect of different DMs.In Table 6, we evaluate the performance of different DMs on RGBNT201.Specifically, "Last" refers to whether the output of TSV corresponds to the output of the last block or the sum of all blocks in the last stage."Freq" refers to the frequency of dense connection is introduced in the last stage.When "Freq" is set to 2, the dense connections are introduced every two blocks in the last stage.Comparing model A and model B, integrating the output of all blocks in the last stage can introduce more fine-grained details.Additionally, with the "Freq" set to 2, model C achieves a mAP of 66.04%, surpassing model A by 2.53%.Finally, model D achieves the best performance with a mAP of 68.07%.

Visualization Analysis
In Figure 3, we visualize the feature distributions of different modules on RGBNT201.In Figure 3a, the original VMamba's extracted features for the same ID are widely dispersed, resulting in poor discrimination.Comparing Figure 3a,b, the introduction of TSV leads to tighter feature grouping within IDs and increased dispersion between different IDs.Additionally, the inclusion of DM results in a more compact feature distribution in Figure 3c.Comparing Figure 3d with Figure 3c, the introduction of CVF increases the spacing between indistinguishable IDs while reducing intra-ID spacing.These visualizations provide strong evidence of the efficacy and superiority of our modules.

Conclusions
In this study, we introduce a novel fusion framework, MambaReID, designed for multi-modal object ReID.We are the first to explore the potential of Mamba in multi-modal object ReID, and we find that its final stage critically disrupts the ReID features.Thus, MambaReID utilizes a Three-Stage VMamba (TSV) architecture to derive robust multimodal representations that are rich in detail yet require lower computational resources.To enhance its discriminative power, we incorporate a Dense Mamba (DM) module within the TSV to fully exploit various levels of semantic features.Furthermore, leveraging wellaligned multi-modal images, we implement a Consistent VMamba Fusion (CVF) technique aimed at refining the granularity of modal integration.Comprehensive testing across three public benchmarks validates the superior performance and efficiency of our framework.

Figure 1 .
Figure 1.The overall architecture of our MambaReID.First, the images form RGB, NIR and TIR modalities are fed into the backbone TSV.With the lightweight TSV, we can extract robust multimodal representations from different modalities.Then, in the last stage of TSV, DM is utilized to further intergrate the information from both shallow and deep features.Finally, with the well-aligned multi-modal images, CVF is employed to enhance the modal aggregation granularity at the same spatial positions.Thanks to the proposed modules, our MambaReID generates more discriminative multi-modal information with low computational complexity.

Figure 2 .
Figure 2. The visualization of 2D Selective Scan (SS2D) mechanism based on TIR images.From left to right, the SS2D scans the input image in four directions for comprehensive information perception.Different directions of the red arrows represent different scanning directions.
, we conduct ablation studies to evaluate the effectiveness of different components on RGBNT201.Model A is the baseline model with only TSV.With the introduction of CVF, model B achieves a mAP of 68.03%, surpassing model A by 4.22%.Through introducing DM, model C achieves a mAP of 68.07%, surpassing model A by 4.26%.Finally, with both DM and CVF, our MambaReID achieves a mAP of 72.20%, surpassing model B and model C by 4.17% and 4.13%, respectively.As for complexity, model B and model C only introduce a small computational overhead.Especially for model C, the computational overhead is negligible.These results fully authenticate the effectiveness of our proposed components.

Table 2 .
Comparative performance of different components.

Table 3 .
Comparative number of parameters for different methods.The symbol * represents the Transformer-based methods, while † denotes the Mamba-based methods.For the rest of the methods, they are CNN-based methods.

Table 4 .
Performance comparison with different baselines.The symbol ‡ denotes the backbone's last stride set to 1.

Table 5 .
Comparison of different depths of CVF.

Table 6 .
Comparison of different DM settings.