Joint Modal Alignment and Feature Enhancement for Visible-Infrared Person Re-Identification

Visible-infrared person re-identification aims to solve the matching problem between cross-camera and cross-modal person images. Existing methods strive to perform better cross-modal alignment, but often neglect the critical importance of feature enhancement for achieving better performance. Therefore, we proposed an effective method that combines both modal alignment and feature enhancement. Specifically, we introduced Visible-Infrared Modal Data Augmentation (VIMDA) for visible images to improve modal alignment. Margin MMD-ID Loss was also used to further enhance modal alignment and optimize model convergence. Then, we proposed Multi-Grain Feature Extraction (MGFE) Structure for feature enhancement to further improve recognition performance. Extensive experiments have been carried out on SYSY-MM01 and RegDB. The result indicates that our method outperforms the current state-of-the-art method for visible-infrared person re-identification. Ablation experiments verified the effectiveness of the proposed method.


Introduction
Person re-identification (person re-id) is a technique for retrieving and matching specific pedestrian images across cameras [1,2] and is crucial for intelligent surveillance. Visible-infrared person re-identification is a subtask of person re-id that aims to facilitate cross-modal retrieval and matching between visible and infrared person images. This technology has gained significant attention in the field of intelligent analysis for infrared cameras, which are widely deployed in cities to enhance citizen security. Given its capacity to accurately and efficiently recognize individuals in both day and night settings, visibleinfrared person re-id has the potential to significantly enhance public safety.
Traditional person re-id has prioritized addressing challenges such as viewpoint changes, pose changes, and occlusions [3][4][5]. However, visible-infrared person re-id presents a distinct challenge that necessitates a shift in focus towards modal alignment. This task is complicated by the significant differences between visible and infrared images, with visible images containing richer color and texture information compared to infrared images. The existing methods for modal alignment, which aim to reduce the modal discrepancy, can be divided into two categories. The first category focuses on the feature extraction process by using carefully designed network architectures and loss functions to extract modal alignment features [6][7][8][9][10][11]. However, these methods often suffer from significant design bias and complex network structures, which may limit their transferability. The second category attempts to reduce the modal discrepancy at the pixel level. For example, GAN-based methods [12,13] transform visible images into an intermediate representation between visible and infrared modals, but at the cost of high computational complexity

•
The Joint Modal Alignment and Feature Enhancement (MAFE) network for visibleinfrared person re-id is proposed to improve modal alignment and feature learning. Experiments on the SYSU-MM01 and the RegDB showed that our proposed method achieved the SOTA.

Visible-Infrared Person Re-Identification
Visible-infrared person re-id is an extension of person re-id, which focuses on the problem of re-identification between visible and infrared images. Wu et al. [17] initiated the first attempt by proposing a deep zero-padding network to learn the common features under the two modalities and released the SYSU-MM01 dataset. Since then, visibleinfrared person re-id has been widely noticed and studied by scholars. Most of the existing visible-infrared person re-id methods focus on how to reduce the modal difference between visible and infrared images and achieve modal alignment. These methods can be broadly classified into representation learning methods, metric learning methods, adversarial network-based methods and auxiliary image-based methods, depending on the pathway used to perform modal alignment. The representation learning methods [8,11] focus on how to improve the feature extraction process in both modals. Metric learning methods [16,18,19] focus on modifying the loss function during model training to optimize the feature metric space. Both representation learning methods and metric learning methods attempt to align modals during the feature-learning phase, but often have serious artificial design traces and bring complex network structures and calculations, which reduce the transferability of the model. Adversarial network-based methods [12,13,20] and auxiliary image-based methods attempt modality alignment at the pixel level of images. Compared to adversarial network-based methods, which require significant computational resources and introduce additional noise, auxiliary image-based methods aim to generate interme-  [14] proposed a data augmentation that converted the original image into a random single-channel replica copied into three-channel image with 75% probability, or retained the original image with 25% probability, and then sent the augmented image into the model. This method achieved a great performance on visible-infrared person re-id, but it has some problems: the image conversion probability is high, meaning that the generated images have a high probability of resembling the infrared modality, but the retention probability is low, which may lead to the loss of useful color information for re-id. Conversely, when the conversion probability is low, the retention probability is high and the modal alignment may be poor. Therefore, for this method, it is necessary to explore the appropriate balance between image conversion and retention. Additionally, the image transformation operation can introduce more diverse training images, increasing the training difficulty of the Metric Loss in re-id and possibly resulting in poor model convergence.

Data Augmentation
Data augmentation methods, such as cropping, rotation, flipping and mix-up [21], use various data transformation operations to obtain multivariate data to train the model, thereby improving the effect of the model. In recent years, a number of data augmentation methods for visible-infrared person re-identification have also emerged. For example, one method [14] converted the visible image into random single-channel copy, then into a threechannel image. Another method, proposed in [22], involved converting the visible image directly into a grayscale image and using it as the input to the model; the method [23] involved performing data augmentation by randomly selecting one channel of the RGB image and setting the values of the other channels to zero. Overall, the aim of all these data augmentation methods is to reduce the sensitivity of the model to color information, reduce the difference between visible images and infrared images and achieve modal alignment. However, these methods are not tested for appropriate image conversion ratios and ignore some of the color information that contributes to the recognition results.

Feature Extraction
With the rapid development of deep learning, feature extraction based on convolutional neural networks and Transformer networks has gradually replaced traditional feature extraction methods. Due to its small model size and fast computation, the convolutional network is more suitable for pedestrian re-identification tasks compared to the Transformer feature extraction structure, which has high computational requirements. The representative of convolutional backbone networks is the Resnet [24] network, which was proposed by He et al. in 2015. Many methods have been developed to improve Resnet, such as ResNeXt [25], which introduces the multi-branch structure of the Inception [26] network into residual blocks to enhance the model's multi-granularity information perception ability, and Resnest [27], which introduces attention mechanisms on this basis to obtain better feature extraction ability. In order to obtain more discriminating features upon the backbones, mining images for more granular information is a common means. For example, in object detection tasks, a common operation to enhance the detection effect of small objects is to continuously fuse shallow features with deep features; for multi-scale object detection, the paper [28] proposes to use multiple dilation rates of dilated convolution to make the model have multi-scale receptive fields and improve the model's perception ability to different granularity information. This paper explores the multi-grain information in visible and infrared images from the perspective of feature extraction to improve the re-id performance.

Method
In this section, we first provide an overview of the overall approach of our method in Section 3.1. Then, in Section 3.2, we introduce the Modal Alignment Operations (MAO) Sensors 2023, 23, 4988 4 of 16 used in this paper, which includes the Visible-Infrared Modal Data Augmentation (VIMDA) for image preprocessing and the Margin MMD-ID Loss for training. Next, in Section 3.3, we describe the Multi-Grain Feature Extraction (MGFE) Structure proposed in this work. Finally, in Section 3.4, we present all the loss functions that make up our model.

Overview
We used Resnest50 [27], pre-trained on ImageNet, as the backbone, as is shown in Figure 1. Resnest50 is a superior backbone network for feature extraction compared to Resnet50, and its multi-branch and multi-perceptual field structure can effectively extract multi-grain information. Additionally, following the AGW [1], we incorporated the Non-Local [29] attention mechanism into Layer 3 and Layer 4 of Resnest50 to enhance the network's spatial relationship modeling capabilities. Specifically, the Non-Local attention was embedded after the last two Blocks in Layer 2 of Resnest50, with each Block embedding a Non-Local attention mechanism. The Non-Local attention was also embedded after the last three Blocks in Layer 3 of Resnest50, with each Block embedding the Non-Local attention mechanism. This can be seen in Figure 2. operation to enhance the detection effect of small objects is to continuously fuse shallow features with deep features; for multi-scale object detection, the paper [28] proposes to use multiple dilation rates of dilated convolution to make the model have multi-scale receptive fields and improve the model's perception ability to different granularity information. This paper explores the multi-grain information in visible and infrared images from the perspective of feature extraction to improve the re-id performance.

Method
In this section, we first provide an overview of the overall approach of our method in Section 3.1. Then, in Section 3.2, we introduce the Modal Alignment Operations (MAO) used in this paper, which includes the Visible-Infrared Modal Data Augmentation (VIMDA) for image preprocessing and the Margin MMD-ID Loss for training. Next, in Section 3.3, we describe the Multi-Grain Feature Extraction (MGFE) Structure proposed in this work. Finally, in Section 3.4, we present all the loss functions that make up our model.

Overview
We used Resnest50 [27], pre-trained on ImageNet, as the backbone, as is shown in Figure 1. Resnest50 is a superior backbone network for feature extraction compared to Resnet50, and its multi-branch and multi-perceptual field structure can effectively extract multi-grain information. Additionally, following the AGW [1], we incorporated the Non-Local [29] attention mechanism into Layer 3 and Layer 4 of Resnest50 to enhance the network's spatial relationship modeling capabilities. Specifically, the Non-Local attention was embedded after the last two Blocks in Layer 2 of Resnest50, with each Block embedding a Non-Local attention mechanism. The Non-Local attention was also embedded after the last three Blocks in Layer 3 of Resnest50, with each Block embedding the Non-Local attention mechanism. This can be seen in Figure 2.

Layer3 in Resnest50
Resnest50 Block Non-Local attention Our proposed model is a dual-stream structure based on Resnest50, as shown in Figure 1. The initial Stem Layer, which includes the initial 77  convolutional layer, BN layer, ReLu layer and Max-pooling layer, is parameterized independently, while the other parts share weight parameters.
During the feature extraction stage, visible images are first input into the network, followed by the application of Visible-Infrared Modal Data Augmentation (VIMDA) and then the backbone network's Stem. The Layer1-3 layers are used to extract preliminary feature maps. Next, the Multi-Grain Feature Extraction (MGFE) Structure is applied to enhance the multi-grain information components in the feature map. Finally, Layer 4 is used to extract features to generate the final visible feature map. The feature extraction process for infrared images does not involve VIMDA but is otherwise consistent with the feature extraction process for visible images. The feature extraction for visible and infrared Our proposed model is a dual-stream structure based on Resnest50, as shown in Figure 1. The initial Stem Layer, which includes the initial 7 × 7 convolutional layer, BN layer, ReLu layer and Max-pooling layer, is parameterized independently, while the other parts share weight parameters.
During the feature extraction stage, visible images are first input into the network, followed by the application of Visible-Infrared Modal Data Augmentation (VIMDA) and then the backbone network's Stem. The Layer1-3 layers are used to extract preliminary feature maps. Next, the Multi-Grain Feature Extraction (MGFE) Structure is applied to enhance the multi-grain information components in the feature map. Finally, Layer 4 is used to extract features to generate the final visible feature map. The feature extraction process for infrared images does not involve VIMDA but is otherwise consistent with the feature extraction process for visible images. The feature extraction for visible and infrared images can be seen in Figure 1 and Equation (1).
where X vis and X inf , respectively, represent the visible image and the infrared image; F C×H×W During the model training phase, firstly, all feature maps are subjected to GeM [30] pooling to output feature vectors. Then, the Metric Loss (including Hc-Tri Loss [19] and the Enhanced WRT Loss [14]) and the Margin MMD-ID Loss [16] between the visible and infrared feature vectors are calculated. After that, the feature vectors are normalized using Batch Normalization (BN) and mapped using fully connected layers to compute the ID Loss (Cross-Entropy Loss). During the model inference phase, BN-normalized feature vectors are used for inference, as is shown in Figure 1.

Modal Alignment Operations (MAO)
In the proposed method, we first propose a new data augmentation method called Visible-Infrared Modal Data Augmentation (VIMDA) to reduce the modal difference, which incorporates a grayscale conversion operation into the previous CAJ method [14], since grayscale images have a similar appearance to the single-channel image and are the integration of all RGB color channels. The specific implementation process of the VIMDA is shown in Figure 3a. When given an RGB image as input, the original image may be randomly modified or replaced based on fixed probabilities. Specifically, the original image has a probability of "a%" to be preserved, a probability of "b%" to be replaced with a grayscale image or a probability of "c%" to be replaced with a new three-channel image generated by replicating a random single channel. The probability ratio of a, b and c is determined through well-designed experiments, and the sum of all probabilities is equal to one, as shown in Equation (2). randomly modified or replaced based on fixed probabilities. Specifically, the original image has a probability of "a%" to be preserved, a probability of "b%" to be replaced with a grayscale image or a probability of "c%" to be replaced with a new three-channel image generated by replicating a random single channel. The probability ratio of a, b and c is determined through well-designed experiments, and the sum of all probabilities is equal to one, as shown in Equation (2). In general, the difference between VIMDA and the data augmentation in the CAJ method [14] is that VIMDA introduces an additional grayscale conversion operation, and the probability ratios in VIMDA are carefully designed through experiments, as shown in Data augmentation in CAJ method The probability ratio is artificially set Due to the use of VIMDA, the input data of the visible branch in the proposed model have changed from a single RGB image to a diverse and complex input. During the In general, the difference between VIMDA and the data augmentation in the CAJ method [14] is that VIMDA introduces an additional grayscale conversion operation, and the probability ratios in VIMDA are carefully designed through experiments, as shown in Figure 3a Due to the use of VIMDA, the input data of the visible branch in the proposed model have changed from a single RGB image to a diverse and complex input. During the training process, we found that such an increase in input for the visible branch made the convergence of the Metric Loss more difficult and even occasionally caused non-convergence. Therefore, this paper introduces the Margin MMD-ID to constrain the distance between the output distributions of the visible branch and the infrared branch, achieving modal alignment and helping the model converge, as is shown in Figure 1. The formula for the Margin MMD-ID loss is shown in Equation (3). where

Multi-Grain Feature Extraction (MGFE) Structure
In order to improve the representation learning ability of the model, this paper proposes the Multi-Grain Feature Extraction (MGFE) Structure and embeds it between Layer 3 and Layer 4 of the backbone. MGFE contains two parts: the Dilated Encoder [28] module (DE) and the Cross-Layer Feature Fusion (CLFF), as shown in Figure 4.  The DE module contains two parts, the Projector and the Residual Blocks, as shown in Figure 5. In this, the Residual Blocks are obtained by concatenating four residual blocks, each of which contains a dilated convolution with a different dilated ratio (2,4,6,8). The use of cascading dilated convolutions with different dilated ratios allows the model to obtain a more multi-scale perceptual field and improves the ability of the model to output features containing multi-grain information. In MGFE, the DE module is embedded behind Layer 3 in the form of residual connection, as is shown in Figure 4.  The DE module contains two parts, the Projector and the Residual Blocks, as shown in Figure 5. In this, the Residual Blocks are obtained by concatenating four residual blocks, each of which contains a dilated convolution with a different dilated ratio (2,4,6,8). The use of cascading dilated convolutions with different dilated ratios allows the model to obtain a more multi-scale perceptual field and improves the ability of the model to output features containing multi-grain information. In MGFE, the DE module is embedded behind Layer 3 in the form of residual connection, as is shown in Figure 4.
The DE module contains two parts, the Projector and the Residual Blocks, as shown in Figure 5. In this, the Residual Blocks are obtained by concatenating four residual blocks, each of which contains a dilated convolution with a different dilated ratio (2,4,6,8). The use of cascading dilated convolutions with different dilated ratios allows the model to obtain a more multi-scale perceptual field and improves the ability of the model to output features containing multi-grain information. In MGFE, the DE module is embedded behind Layer 3 in the form of residual connection, as is shown in Figure 4. The CLFF fuses the output of Layer 1 with the residual output of the DE module. The specific feature fusion is performed by down sampling the output of Layer 1 with a 1*1 convolutional layer with a stride size of 4, then concatenating it with the residual output of the DE module in the channel dimension and finally dimensionally transforming it with a 1*1 convolutional layer before feeding it into Layer 4 for further information extraction and integration.

Loss Function
In addition to using the Margin MMD-ID Loss to constrain the distance between the visible features distribution and the infrared features distribution after Gem pooling, the Metric Loss, including Hc-Tri Loss [19] and the Enhanced WRT Loss [14] as is shown in Equations (4) and (5), is also applied to all pooled features to further optimize the relative distances between features and feature centers in the metric space. Simultaneously, the Cross-Entropy Loss is calculated as the ID Loss for the features following the fully connected layer. Finally, the total loss function during model training phase is shown in Figure 6 and Equation (6). The CLFF fuses the output of Layer 1 with the residual output of the DE module. The specific feature fusion is performed by down sampling the output of Layer 1 with a 1 × 1 convolutional layer with a stride size of 4, then concatenating it with the residual output of the DE module in the channel dimension and finally dimensionally transforming it with a 1 × 1 convolutional layer before feeding it into Layer 4 for further information extraction and integration.

Loss Function
In addition to using the Margin MMD-ID Loss to constrain the distance between the visible features distribution and the infrared features distribution after Gem pooling, the Metric Loss, including Hc-Tri Loss [19] and the Enhanced WRT Loss [14] as is shown in Equations (4) and (5), is also applied to all pooled features to further optimize the relative distances between features and feature centers in the metric space. Simultaneously, the Cross-Entropy Loss is calculated as the ID Loss for the features following the fully connected layer. Finally, the total loss function during model training phase is shown in Figure 6 and Equation (6).
where c i v and c i t , respectively, represent the feature centroid of the visible images and the infrared images of person i and c j n represents the centroid of all visible and infrared image features of a specific person, excluding person i. P represents the number of persons, K represents the number of visible images or infrared images per person during training, v i and t i represent the visible features and infrared features of person i and ρ 2 represents the margin hyper-parameter of the Hc-Tri Loss.
where i represents any sample, j represents any positive sample of i and k represents any negative sample of i. L Total = L Enhanced WRT + L Cross-Entropy + λ 1 L Margin MMD-ID + λ 2 L Hc_Tri (6) where L Total denotes the total loss function in the proposed model while λ 1 and λ 2 denote the weight hyper-parameters of the Margin MMD-ID Loss and the Hc-Tri Loss.
where i represents any sample, j represents any positive sample of i and k represents any negative sample of i .
SYSU-MM01 was obtained from 2 indoor visible cameras, 2 outdoor visible cameras and 2 near infrared cameras. For the training phase, 22,258 visible images and 11,909 infrared images from 395 persons were used. Following [17], the testing phase involved two modes: All Search mode and Indoor Search mode. In All Search mode, we utilized 3803 infrared images of 96 pedestrians from the test set as queries and selected 301 images at random from all visible images in the test set as the gallery for recognition. In Indoor Search mode, we used 3803 infrared images of the same 96 pedestrians from the test set as queries and selected 112 images at random from the indoor visible camera in the test set as the gallery for recognition.
The RegDB dataset comprised 412 pedestrians, captured using a visible light camera and a thermal imaging camera. Each pedestrian had 10 visible images and 10 infrared images. The dataset was divided into a training set and a test set, with all images of 206 pedestrians used for training and the remaining 206 pedestrians used for testing.

Evaluation Metric
This paper evaluates the performance of the proposed model using the Cumulative Match Characteristic (CMC) and mean Average Precision (mAP) metrics.

Experimental Settings 4.2.1. Basic Settings
The experiments were carried out with Pytorch on a single NVIDIA TITAN V GPU. During training, in addition to using VIMDA for visible images, cropping, random erasing and random horizontal flipping were performed on all visible light and infrared images. During model inference, no data augmentation was performed on the input image. In VIMDA, the probability of retaining the visible image was 40%, the probability of converting to grayscale was 30% and the probability of converting to a new three-channel image copied from any single channel was 30%. The stride of the down-sampling convolution layer between Layer 3 and Layer 4 was set to 1. The margin hyper-parameters in the Margin MMD-ID Loss and the Hc-Tri Loss were 1.7 and 4, and the weight hyper-parameters of the Margin MMD-ID Loss and the Hc-Tri Loss in the total loss function were 0.12 and 0.05. We used the stochastic gradient descent method (SGD) and the dynamic learning rate to update the model gradient, as shown in (7). The number of iterative epochs during training was 100; each training batch contained 8 people, and each person had 4 visible images and 4 infrared images.

Test Settings
For the SYSU-MM01 dataset, All Search mode and Indoor Search mode were adopted to test the performance of the proposed method. Specifically, we conducted 10 random tests and computed the average CMC and mAP scores across the 10 runs as the overall performance of the model. For the RegDB dataset, the test set consisted of two modes: visible image re-identification infrared image (Visible to Infrared) and infrared image reidentification visible image (Infrared to Visible). As the RegDB dataset has a small volume, we divided both the training and test sets into 10 subsets and performed a 10-fold crossvalidation to obtain the final CMC and mAP scores of the model.
As shown in the table, on the All Search test mode of the SYSU-MM01 dataset, our proposed method achieved Rank1 and mAP of 74.85% and 71.7%, respectively, which are 3.17% and 2.91% higher than the best MAUM model, respectively. On the Indoor Search test mode of the SYSU-MM01 dataset, our proposed method achieved Rank1 and mAP of 80.85% and 84.06%, respectively, which are 3.08% and 2.21% higher than the best MAUM model, respectively. On the Visible to Infrared test mode of the RegDB dataset, our proposed method achieved Rank1 and mAP values of 91.17% and 81.81%, respectively, with a 3.3% increase in Rank1 compared to the MAUM model, but a decrease in mAP. On the Infrared to Visible test mode of the RegDB dataset, our proposed method achieved Rank1 and mAP values of 90.05% and 80.27%, respectively, with a 3.55% increase in Rank1 compared to the MAUM model, but a 4.07% decrease in mAP. Upon comparing the results, it is evident that our proposed method MAFE outperforms other existing methods on the SYSU-MM01 dataset; on the RegDB dataset, MAFE achieves a significantly higher Rank1 value than other existing methods, although the mAP value falls short of the MAMU method. These results serve as a testament to the effectiveness and superiority of the MAFE.

Ablation Study 4.4.1. Ablation Experiments of Each Component
We conducted ablation experiments on the All Search mode of SYSU-MM01 to evaluate the impact of the MAO (including VIMDA and Margin MMD-ID Loss) and the MGFE (including CLFF and DE) on re-id performance. The results are presented in Table 2. The findings indicate that each component in our proposed method effectively improves the recognition performance. With the loading of all components, the Rank1 and mAP significantly increased by 6.15% and 6%, respectively, compared to Experiment 1. Our experi-mental results demonstrate the effectiveness of each component in our proposed method for visible-infrared person re-identification.

Analysis of the probability ratio in VIMDA
In VIMDA, if too much of the original visible image is retained, the modal alignment will not be achieved. Otherwise, some color information that may be beneficial to person re-identification is lost. Therefore, this paper seeks the optimal probability ratio by fine-tuning the transformation probabilities in VIMDA. The experiment was tested on the All-Search mode of the SYSU-MM01 dataset, and the experimental content as well as the experimental results are shown in Figure 7. As can be seen from the figure, the re-id performance of the model gradually decreased as the proportion of visible images retained increased, and the model reached an optimal re-id performance of Rank1-74.85% and mAP-71.7% when the probability ratio between retaining visible images, converting to grayscale images or converting to random single-channel replication into three-channel images was 4:3:3.

Ablation Experiments of Cross-Layer Feature Fusion (CLFF)
In the All Search mode of the SYSU-MM01 dataset, experiments were conducted on the influence of different layer feature fusion in the CLFF of MGFE. The experimental content and results are shown in Table 3, where the numbers denote the output of the corresponding layer and the R-O denotes the residual output of the DE module. The table clearly indicates that the cross-layer fusion of the output from Layer 1 and the residual output from the DE module resulted in the best re-id performance.
Search mode of the SYSU-MM01 dataset, and the experimental content as well as the experimental results are shown in Figure 7. As can be seen from the figure, the re-id performance of the model gradually decreased as the proportion of visible images retained increased, and the model reached an optimal re-id performance of Rank1-74.85% and mAP-71.7% when the probability ratio between retaining visible images, converting to grayscale images or converting to random single-channel replication into three-channel images was 4:3:3.

Figure 7.
Experimental graph for the probability ratio in VIMDA; the abscissa in the graph represents the proportion of probability between preserving the original visible image, converting to grayscale image or converting to random single-channel replication into three-channel images. Figure 7. Experimental graph for the probability ratio in VIMDA; the abscissa in the graph represents the proportion of probability between preserving the original visible image, converting to grayscale image or converting to random single-channel replication into three-channel images. We built two-stream networks using different backbones (Resnet50, Res2Net50 [38] and Resnest50) to test the effectiveness and robustness of the proposed method (MAO and MGFE) in this paper, as shown in Table 4. The results presented in Table 4 provide strong evidence that the proposed method (MAO and MGFE) significantly improves visible-infrared re-id performance across various backbone networks. In particular, the Resnest50 backbone with MAO and MGFE added yielded the highest re-id accuracy, with Rank1-74.85% and mAP-71.7%. These findings effectively demonstrate the effectiveness and robustness of the method proposed in this paper.

Heat Map Analysis of MGFE
We used Grad-Cam [39] to visually analyze the Layer 4 features extracted from the network model with or without MGFE. As can be seen from Figure 8, the embedding of MGFE, on the one hand, enabled the model to better focus on the person itself and distinguish the background; on the other hand, it enabled the model to focus on more identifiable features of different scales in different parts of the person image. The visual heat map clearly illustrates that MGFE significantly enhanced the model's feature-learning capability and improved the performance of visible-infrared person re-id.
Sensors 2023, 23, x FOR PEER REVIEW 13 heat map clearly illustrates that MGFE significantly enhanced the model's fea learning capability and improved the performance of visible-infrared person re-id.  Figure 8. We used Grad-Cam to visually analyze the Layer 4 features extracted from the mode or without MGFE. Three pictures are a group, the left picture in each group is an infrared ima visible image, the middle picture is a heat map with MGFE and the right picture is a heat without MGFE.

Output of Visible-Infrared Person Re-ID Results
In the SYSU-MM01 and RegDB test sets, we selected several infrared imag individuals as the Query and used the method proposed in the paper to match the vi person images in the corresponding Gallery set. We then visualized the results, as sh in Figures 9 and 10. The number displayed above the output of the results represent similarity ranking with the Query. The lower the number, the higher the simil between the images. The green box indicates the correctly identified sample, while th box indicates the incorrectly identified sample.
The re-identification results of the SYSU-MM01 dataset in Figure 9 demonstrates the proposed method can accurately retrieve all correct samples for simple cases, w

Output of Visible-Infrared Person Re-ID Results
In the SYSU-MM01 and RegDB test sets, we selected several infrared images of individuals as the Query and used the method proposed in the paper to match the visible person images in the corresponding Gallery set. We then visualized the results, as shown in Figures 9 and 10. The number displayed above the output of the results represents the similarity ranking with the Query. The lower the number, the higher the similarity between the images. The green box indicates the correctly identified sample, while the red box indicates the incorrectly identified sample. Sensors 2023, 23, x FOR PEER REVIEW 14 of 17   The re-identification results of the SYSU-MM01 dataset in Figure 9 demonstrates that the proposed method can accurately retrieve all correct samples for simple cases, while for more challenging persons, it can still identify them up to Rank 5. For the RegDB dataset, as is shown in Figure 10, our model can effectively retrieve all easy samples, but due to the poor quality of infrared images in the RegDB dataset and the lack of detailed information, it is difficult for the model to accurately distinguish and recognize some of the easily confused samples.

Next Challenges and Future Perspective
For visible-infrared person re-id, how to improve the model to extract common information from images captured under two modalities and generate more discriminative features is a crucial challenge that restricts the improvement of accuracy. From the results in our paper, we can find that it is worth studying how to improve the recognition effect of the model on cross-modal datasets lacking a large amount of detailed information, such as the RegDB dataset. Additionally, enabling the convolutional model to model long-distance relationships like the Transformer is also a challenging task.

Replicability of the Method
The details and parameters of the proposed method have been presented in Section 3. Method and Section 4.2. Experimental Settings. By following these method details and parameters, the results of this paper can be fully replicated.

Conclusions
This paper proposes the Joint Modal Alignment and Feature Enhancement (MAFE) network for visible-infrared person re-id, whose main focus is on modal alignment and feature enhancement. First, it proposes VIMDA and introduces Margin MMD-ID Loss for modal alignment. Then, the Multi-Grain Feature Extraction (MGFE) Structure is proposed and embedded to extract features that contain multi-grain information and are more discriminative. A large number of experiments were carried out on the method of this paper with the SYSU-MM01 dataset and the RegDB dataset. The experimental results fully demonstrate the effectiveness of the proposed method.

Data Availability Statement:
The data that support the findings of this study are available online. These datasets were derived from the following public resources: [SYSU-MM01, RegDB].