Neighboring-Part Dependency Mining and Feature Fusion Network for Person Re-Identification

Person re-identification (Re-ID) is a computer vision technique used to determine the presence of a specific pedestrian target in an image or video sequence. It is an important branch of image retrieval. With the advancements in deep learning, notable progress has been achieved in Re-ID research. However, existing methods primarily focus on the most prominent features in the image, ignoring other less obvious yet beneficial features and spatial interdependencies within the image. To address this issue, this paper proposes a neighboring-part dependency mining and feature fusion network (NDMF-Net). The network horizontally splits pedestrian features into multiple parts, using a part-level hybrid attention module (PHAM) to focus on the salient region of each part, and a neighboring-part dependency exploration module (NDEM) to extract the spatial dependencies between neighboring parts of the image. Eventually, different features are fused to form the final representation. We validate the NDMF-Net on mainstream datasets and the experimental results demonstrate that our method is effective and achieves state-of-the-art performance.


I. INTRODUCTION
Person re-identification (Re-ID) is a typical sub-task of image retrieval, which uses computer vision, pattern recognition, machine learning, and other technologies to determine whether there is a specific person in the image or video sequence [1]. Person Re-ID has various applications in criminal investigation, video surveillance, intelligent commerce, and other fields [2], [3]. In recent years, it has received extensive attention from both industry and academia.
Before the emergence of deep learning in the field of person Re-ID, traditional methods based on manual features are mainstream. Traditional person Re-ID methods mainly rely on manually created features such as color, texture, shape, etc. This method is not only inefficient but also weak in robustness, especially when dealing with complex and variable scenes, it is difficult to extract features with sufficient discriminability [4], [5], [6]. In recent years, with the advance-The associate editor coordinating the review of this manuscript and approving it for publication was Vivek Kumar Sehgal . ment of computer vision technology, deep learning-based person Re-ID has seen rapid development and achieved good performance [7], [8], [9], [10]. Compared with traditional methods, deep learning-based person Re-ID methods can not only perform end-to-end training, but also adaptively extract useful features from images, and avoid the subjectivity inherent in traditional methods. Furthermore, by jointly learning different features, the performance and robustness of the network have been greatly improved, surpassing that of traditional methods. Additionally, the current data explosion provides a large number of data samples, and deep learning-based methods can study the correlation between group samples, further improving the performance of person Re-ID. Nevertheless, in practical applications, posture changes, background interference, and partial occlusions still pose great challenges for feature recognition and extraction.
Convolutional neural networks (CNNs) are the most widely used deep learning networks in image processing, particularly for feature extraction in person Re-ID tasks. The prevalent features employed in person Re-ID tasks include global features and local features. Global features are easy to recognize and represent, but they only focus on surface clues of the image and ignore specific information from local areas. This makes them unsuitable for identifying complex samples with background interference or partial occlusion [11]. On the other hand, local features provide a detailed account of specific regions in an image, encompassing richer and more refined features that can serve as a supplement to the limitations of global features. The current extensively applied approach for enhancing the performance and robustness of person Re-ID involves the joint extraction of global and local features to form the final representation features [12]. Typically, local features are extracted by using two methods: (1) utilizing attention mechanisms for obtaining information from prominent regions of the entire image as local features [13], [14], or (2) segmenting the image into different parts and using the global features of each part as local features [15], [16]. These two methods correspond to the attention-based and part-based branches of person Re-ID, respectively.
While both methods can extract richer features to some extent, they also possess certain drawbacks. (1) Attentionbased methods typically only focus on the most prominent channels or regions, leading to the omission of other key information that may not be apparent at the overall pedestrian level but can contribute to identification. (2) Part-based methods not only fail to fully consider the relationships between different body parts, but also introduce background interference in parts when using global max pooling and global average pooling to extract the global features of parts as local features of the image. These factors may lead to the omission of some useful features and reduce the expressiveness of the extracted features. Therefore, to obtain more comprehensive discriminative features, it is necessary to fully consider the spatial dependencies between different parts of pedestrians while minimizing the interference from background information. Fig. 1 provides some intuitive representations of this concept.
To achieve this objective, we propose an NDMF-Net, which consists of a backbone network, two part-level hybrid attention modules (PHAM), and two neighboring-part dependency exploration modules (NDEM). The PHAM includes a part-level channel attention module and a spatial attention module. The network considers the interdependencies between neighboring parts of the pedestrian and employs a specially designed attention module to focus on salient channels and regions of different parts of the image. Specifically, the network first horizontally segments the pedestrian features extracted from the middle layer of the backbone network into several parts, then uses the PHAM to focus on salient regions in the feature maps to extract discriminative fine-grained features. Following that, the NDEM is employed to investigate the spatial dependencies between neighboring parts of the features. Finally, various features are fused into the final representation and sent to the classification layer for classification.
The main contributions of this work are as follows: • We propose an end-to-end NDMF-Net, which can extract dependencies between neighboring parts in pedestrian images to improve the model's feature representation and recognition ability.
• We design a PHAM and an NDEM. The PHAM can guide the network to focus on salient regions at different positions in pedestrian images and extract corresponding global and local features. The NDEM can extract spatial dependencies between neighboring parts, enriching the semantic continuity of features.
• We evaluate our method on three mainstream datasets, i.e., Market-1501, DukeMTMC-ReID, and MSMT17, and the results show that our method achieves state-ofthe-art performance.
This paper is organized as follows. Section II introduces related work. Section III provides a detailed description of the proposed NDMF-Net. Section IV presents the details and results of the experiments. Finally, Section V concludes the paper and discusses prospects.

II. RELATED WORK
In recent years, with the advancement of deep learning, researchers have proposed various deep learning-based methods to address the problem of person Re-ID and achieved remarkable performance. In the following sections, we will briefly introduce the related works on person Re-ID.
A. PART-BASED Re-ID Current CNN-based person Re-ID models have achieved impressive results, especially when part-based feature extraction techniques are employed. These models have shown notable enhancements in feature representation and recognition, which have been verified through various studies [14], [17], [18], [19], [20]. To learn discriminative part features, researchers have proposed various part-based CNNs. The most common approach is to divide the input features into VOLUME 11, 2023 several parts according to certain rules and then learn the features of each part separately. Reference [17] first proposed a strong part-based convolutional baseline (PCB), which divides the image vertically into six parts and then extracts the features of each part by global averaging pooling, resulting in a significant performance boost. Part misalignment is the problem that comes with part-based CNNs. To address this issue, [18] proposed a dynamic alignment method that uses dynamic programming to find the shortest possible path between local features for conducting alignment and matching tasks, leading to a significant improvement in performance. To extract individual part features, the paper utilizes a process of horizontal partitioning the features, which is then followed by a procedure of globally pooling each part in the horizontal direction. Reference [19] proposed a method for spatial feature reconstruction with the goal of achieving an alignment-free system. To overcome the obstacles posed by the local misalignment of individuals, a multi-scale block representation is proposed, where features at different scales are obtained via averaging pooling. Reference [20] proposed a multi-branch network to enhance feature representation by fusing global and local features. Local feature extraction is performed by horizontally segmenting input features into multiple strips, and each strip is then considered a local feature after implementing a global maximum pooling operation.
The aforementioned studies indicate that current part-based person Re-ID methods primarily enhance feature richness by jointly learning features of the whole body and body parts, thereby improving the performance of the network. However, the relationships between different body parts are not taken into full consideration, leading to a loss in feature expressiveness. To address this issue, an NDEM is constructed to extract the dependencies between neighboring parts, as depicted in Fig. 1(c). This internal dependency attribute based on single pedestrians is largely immune to misalignment and can further enrich the representation of pedestrian features.

B. ATTENTION-BASED METHODS
Attention mechanisms can adaptively find salient regions in complex scenes by simulating the human visual system. Many scholars have introduced attention mechanisms into person Re-ID and achieved remarkable results, making attention-based person Re-ID an important branch of this field. In recent times, researchers have proposed various attention models and enhanced the feature representation by collectively using multiple attention mechanisms to extract features at different levels.
Hu et al. [21] introduced the concept of channel attention and presented the SE-Net. Unlike conventional methods that completely pass on weights to the subsequent layer, the SE-Net establishes relationships among different channels and readjusts their weights based on their corresponding inter-correlations, ultimately exhibiting robust generalization abilities. Subsequently, this work was implemented in [22] and achieved remarkable performance. Chen proposed a High-Order Attention (HOA) [23], which models and utilizes complicated higher-order statistical information to differentiate pedestrian images. Cai [24] proposed a multi-scale body part mask-guided attention network (MMGA), that employs body part masks to steer corresponding attention training. However, how to extract the exact mask is a problem. Reference [25] designed a global pooling feature-pooling module to collect features from foreground, background, and spatial attention maps to generate a global descriptor. To extract more comprehensive local features, [26], [27] proposed a novel local attention model that utilizes the Squeeze-and-Excitation module to choose the most prominent channel features of the entire image as body part features. However, this approach may disregard features that are not globally prominent yet helpful for identification purposes.
The abovementioned attention-based methods typically focus on channels or regions that contain the most salient features. Nevertheless, the importance of these channels or regions may differ across different parts of the body, potentially leading to the omission of other essential information that may not be obvious at a general level but is helpful for identification. Additionally, segmenting the image into several distinct parts and treating the global features of each part as local features cannot prevent the interference of background information in these parts [15], [16]. To this end, we design a part-level hybrid attention module that adaptively selects important channels and regions for different body parts to extract as many features as possible, avoiding the omission of valuable features that may not be salient at the overall pedestrian level. Furthermore, all the features are filtered thoroughly at an overall level to retain the truly effective features, while minimizing the interference of background information in each body part. this is illustrated in Fig. 1(d).

III. PROPOSED METHOD
The proposed NDMF-Net aims to establish the internal dependencies of neighboring body parts by employing location information in a unified person Re-ID framework. The overall structure of the NDMF-Net, as shown in Fig. 2, consists of a backbone network, two CHAMs, and two NDEMs. The PHAM consists of a part-level channel attention module (PCAM) and a spatial attention module (SAM). In this paper, we use ResNet-50 as the backbone network, pre-trained on ImageNet, for extracting global features from pedestrian images. The PHAMs, embedded after Layer 1 and Layer 3, effectively suppress irrelevant noise in pedestrian images and enhance the expression of salient regions. The NDEMs are used to leverage the correlation of human spatial structure and extract dependencies between neighboring parts. Finally, the features extracted by different modules are fused to form the final discriminative features. A detailed introduction of the individual modules is provided in the following subsections.

A. BACKBONE
In this paper, we adopt ResNet-50, which is widely used in person Re-ID tasks and pre-trained on ImageNet, as the backbone network. ResNet-50 is a deep residual network proposed by He [28], which can effectively solve the problem of gradient disappearance in deep networks.
In the specific application, we made some small changes to the ResNet-50. Firstly, we set the stride of the conv2 layer and the downsample layer in Layer 4 to 1. Secondly, during the training process, we set the out_feature of the fc layer to the number of pedestrian IDs corresponding to different datasets.

B. PART-LEVEL HYBRID ATTENTION MODULE
The PHAM architecture comprises two modules: a part-level channel attention module (PCAM) and a spatial attention module (SAM). The PHAM aims to emphasize significant channels and regions in the extracted features. To achieve this, the PCAM uses a group of channel attention branches that selectively detect key channels for each part in an adaptive manner. On the other hand, the ASM globally adjusts and weights the features produced by the PCAM to extract genuinely useful features, while simultaneously reducing the influence of extraneous information from the surrounding background areas.

1) PART-LEVEL CHANNEL ATTENTION MODULE
The channel attention mechanism in a neural network is an additional network capable of explicitly modeling the correlation between different channels. This mechanism dynamically adjusts the weights of different channels via learning, increasing the weight of crucial channels and decreasing the weight of unimportant ones. This approach enhances significant features and suppresses unimportant ones. In the person Re-ID task, considering that features within the same channel are of different importance to different body parts of the pedestrian, we design a PCAM. The overall architecture of the PCAM is presented in Fig. 3.
As shown in Fig. 3, the proposed PCAM consists of a group of channel attention branches, each of which consists of two pooling layers, namely Global Average Pooling (GAP) and Global Max Pooling (GMP), a Fully Connected (FC) layer, and a Sigmoid function. For a given feature map X ∈ R C×H ×W , where C is the channel numbers, H is the height, and W is the width, divide X along the height dimension into N parts and each part can be expressed as X i ∈ R C×H / N ×W (i = 1, 2, . . . , N ). Feed X i into the corresponding channel attention branch, the generated branch channel attention map can be expressed as: where A C i is the branch channel attention map of X i , σ denotes the sigmoid function, W i ∈ R C×2C denotes the parameters of the fully connected layer, cat (⊙) denotes the concatenation operation along the channel dimension, X gap i and X gmp i denote the feature map generated by the GAP and the GMP.
After getting the branch channel attention map A C i ∈ R C×1×1 , expand it toÃ C i ∈ R C×H / N ×W by broadcasting operation, and then embedÃ C i into X i via element-wise multiplication, a new feature map X C i can be obtained: where X C i is the output feature map of X i , ⊗ denotes the element-wise multiplication.
Concatenate all new feature maps to obtain the final output feature map of the PCAM, which can be expressed as:  where X C ∈ R C×H ×W denotes the part-level channel attention map of X , cat (⊙) denotes the concatenation operation along the height dimension, X C i ∈ R C×H ×W .

2) SPATIAL ATTENTION MODULE
The spatial attention mechanism works by converting the original image into a new feature space and creating a weight mask for all locations in an adaptive manner. The weight mask is employed to modulate the output, resulting in an augmented representation of the target region of interest while concurrently attenuating the feature representation of irrelevant background regions.
In this paper, we design a SAM, the overall structure of which is shown in Fig. 4. For a given feature map X ∈ R C×H ×W , where C is the channel number, H is the height, and W is the width, two spatial descriptors are generated after the cross-channel average pooling (CAP) and cross-channel max pooling (CMP) layers. These descriptors are subsequently concatenated to generate a new feature map. This new feature map is then processed using a convolutional layer and a sigmoid function to generate a more efficient descriptor F ∈ R 1×H ×W . The spatial attention map can be expressed as below: where A S ∈ R 1×H ×W denotes the spatial attention map generated by the SAM, σ (⊙) denotes the Sigmoid function, ϕ (⊙) denotes the convolution operation with a kernel size of 3 × 3, cat (⊙) denotes the concatenation operation along the channel dimension, X cap and X cmp denote the spatial descriptors obtained by CAP and CMP.
After getting the spatial attention map A S ∈ R 1×H ×W , expand it toÃ S ∈ R C×H ×W by broadcasting operation, and then embedÃ S into X via element-wise multiplication. The output feature map of the SAM can be expressed as: where X S ∈ R C×H ×W denotes the output feature map of the SAM, X denotes the input feature map, ⊗ denotes the element-wise multiplication, andÃ S denotes the expanded spatial attention map.

3) PART-LEVEL HYBRID ATTENTION MODULE
The PHAM is constructed by placing the PCAM and SAM in a sequential manner [29], as shown in Fig. 2. Given an input feature map X ∈ R C×H ×W , where C is the channel numbers, H is the height, W is the width, the feature generated after passing through the CHAM can be expressed as: where X CHAM ∈ R C×H ×W denotes the feature map generated by the PHAM, Att S (⊙) denotes the SAM, and Att C (⊙) denotes the PCAM.

C. NEIGHBORING-PART DEPENDENCY EXPLORATION MODULE
The Long Short-Term Memory (LSTM) network is a widely used variant of traditional recurrent neural networks developed by Hochreiter [30], capable of learning long-term dependencies in sequence modeling tasks. Unlike traditional RNNs, LSTM networks employ two modules to facilitate the learning of long-term dependencies, namely the memory module and the gate module. The gate module consists of an input gate, an output gate, and a forget gate. These gates enable the LSTM network to selectively add or remove information from the cell state and effectively capture long-term dependencies in sequence modeling tasks. The structure of the LSTM unit is illustrated in Fig. 5, and the relationship between the input and output is demonstrated as follows: 49764 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  where f t , i t , and o t represent the input gate, the forget gate, and the output gate, respectively. h t denotes the hidden state of the t step, x t denotes the input of the t step, C t denotes the cell state of the t step,C t is a vector of new candidate values, σ denotes the sigmoid function, W and b indicate the corresponding weights and biases, respectively. The spatial information presented in pedestrian images includes not only rich global and local features but also interdependencies between neighboring parts. To effectively explore and establish these dependencies, an LSTM-based neighboring-part dependency exploration module (NDEM) is designed, as illustrated in Fig. 6, to uncover meaningful features in the image. For a given feature map X ∈ R C×H ×W , where C is the channel number, H is the height, and W is the width. The corresponding feature vector f D ∈ R 1×m generated by the NDEM can be formulated as follows: where σ (⊙) denotes the Sigmoid function, W ∈ R m×2NR denotes the parameters of the FC layer, cat (⊙) denotes the concatenation operation along the first dimension, LSTM (⊙) denotes the LSTM network with two hidden layers, and the hidden size is R, φ (⊙) denotes the matrix of feature map processing operations that includes reshaping, permutation, and flattening operations. Through these matrix operations, we obtain features that incorporate rich spatial dependencies.

D. FINAL REPRESENTATION
As shown in Fig. 2 Then fuse the feature vector f B and f D to generate the final feature representation, which can be expressed as: (10) where W denotes the parameters of the FC layer, and cat (·) denotes the concatenation operation along the first dimension.

E. LOSS FUNCTION
To enhance feature distinctiveness, we utilize a joint loss framework to optimize the model parameters. The loss function comprises two primary components: cross-entropy loss and triple loss [31].

1) CROSS-ENTROPY LOSS
The cross-entropy loss function is capable of learning interclass information and is commonly employed in image classification tasks. Additionally, to enhance the model's generalization ability, we adopt a cross-entropy loss with label smoothing, which can be mathematically represented as: (11) where N 0 represents the number of pedestrian images in a mini-batch, q k n represents the true label distribution of the k-th ID with label smoothing, which can be expressed as VOLUME 11, 2023 49765 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. in Equation (12), and p k n represents the predicted probability distribution of the k-th ID.
where K represents the total number of pedestrian IDs, y n represents the truth label of the n-th pedestrian image, and ε is the hyperparameter used for label smoothing.

2) TRIPLET LOSS
Triplet loss is a commonly used metric learning function that effectively minimizes the distance between positive sample pairs and maximizes the distance between negative sample pairs. To enhance the model's capability to identify pedestrians with similar appearances, we adopt a batch-hard triplet loss to optimize the model parameters, which can be mathematically expressed as: where f x a i , f x p i , and f x n i denote the feature map of the anchor, positive, and negative samples, respectively. ∥·∥ 2 represents the distance between feature maps. m represents the hyperparameter that controls the inter-class and intraclass distance.
The final loss can be expressed as: where L total denotes the final loss, α and β denote the loss balance parameters.

IV. EXPERIMENT AND ANALYSIS
To comprehensively evaluate the performance of the proposed NDMF-Net, a series of experiments were conducted. Firstly, the performance of NDMF-Net was tested on three mainstream datasets, Market-1501 [32], DukeMTMC-ReID [2], and MSMT17 [33], and the experimental results were compared with some current state-of-the-art person Re-ID methods. Secondly, ablation experiments were conducted on the Market-1501 dataset to evaluate the impact of each module in NDMF-Net. Besides, we also analyzed the effect of different split numbers, N , on the performance of NDMF-Net on the Market-1501 dataset.

A. EXPERIMENT 1) DATASETS
The Market-1501 dataset, gathered at Tsinghua University, includes 32,668 images of 1,501 pedestrians, captured by 5 high-resolution cameras and 1 low-resolution camera. The DukeMTMC-ReID dataset, collected at Duke University and released in 2017, encompasses 36,411 images of 1,812 pedestrians, captured by 8 cameras. The MSMT17 dataset, released in 2018, is a multi-scenario and multi-time dataset and has 126,441 images of 4,101 pedestrians, captured by 12 outdoor cameras and 3 indoor cameras. Detailed information can be found in Table 1.

2) EVALUATION METRIC
In this paper, we employ the cumulative matching characteristics (CMC) at Rank-1 and the mean average precision (mAP) as evaluation metrics for the model, without using any re-ranking technique during the evaluation process. The CMC at Rank-1 can be expressed as: where m is the total number of images, and x i is an indicator variable. If the i-th probe and the top-ranked image in similarity ranking are the same, The mAP can be expressed as: where k is the number of classes, AP i is the average precision for class i, which is calculated by plotting a curve of precision versus recall and computing the area under the curve.

3) EXPERIMENTAL SETTINGS
During the model training process, we randomly select a batch consisting of P (P = 16) pedestrians, each with K (K = 4) images, and pre-process all images before feeding them into the network. Firstly, we resize all images to 256 × 128 and then fill them with a value of 0 for 10 pixels around them. Subsequently, the resized images are cropped randomly and resized back to 256 × 128, followed by normalization. To augment the sample, we applied random horizontal flipping and random erasing operations. The Adam optimizer is used to optimize the model parameters, with the weight decay factor being set to 5e-4 and the momentum to 0.9, for a total of 200 epochs. The warm-up learning strategy is used to update the learning rate: where ep is the current epoch.
Beyond that, we set the split part number N to 8, and set the loss balance parameters α and β to 0.3 and 0.7, respectively.

B. EXPERIMENTAL RESULTS AND ANALYSIS
To evaluate the performance of the NDMF-Net, we conduct experiments on three mainstream datasets and compare the results with several state-of-the-art person Re-ID methods. Among these methods, [34], [35], [46] use a re-ranking approach to improve the accuracy of the person Re-ID accuracy. Reference [36] investigate feature correlation by decomposing feature maps into multiple subspaces. To achieve a more generalized feature representation, [13], [41], [43], [44] propose a multi-scale feature fusion mechanism. References [7], [40], and [42] incorporate attribute information into the network to produce more discriminative feature representation. References [16], [19], [37], [38], and [39] improve person Re-ID accuracy by more effectively capturing local information within the image. Reference [45] propose a lightweight network that combines deep space and achieves impressive performance. Table 2 to 4 shows the Rank-1 accuracy and mAP of our method and other state-of-the-art methods on three datasets, namely Market-1501, DukeMTMC-ReID, and MSMT17. The experimental results presented in Table 2 indicate that our method achieves 95.8% Rank-1 accuracy and 88.9% mAP on the Market-1501 dataset, demonstrating a superior performance compared with the second-best method with an increase of 0.7% in Rank-1 accuracy and 2.1% in mAP. Similarly, Table 3 presents the experimental results on the DukeMTMC-ReID dataset, where our method achieves 89.7% Rank-1 accuracy and 79.8% mAP. Compared with the second-best Rank-1 accuracy and mAP, achieved by the OSNet [43] and the RANGEv2 [45], respectively, our method outperforms them by 1.0% in Rank-1 accuracy and 1.6% in mAP. Furthermore, Table 4 shows the experimental results on the MSMT17 dataset, which show that our method achieves 78.4% Rank-1 accuracy and 54.5% mAP. The CDNet [44] achieves the best Rank-1 accuracy and mAP, with 78.0% and 54.7%, respectively, outperforming our method by 0.5% and 0.2%. Nonetheless, our method still outperforms other state-of-the-art methods and achieves the second-best performance.
Compared with other state-of-the-art methods, the NDMF-Net focuses on salient regions of all parts of pedestrian images, avoiding the loss of detailed information caused by focusing only on globally salient regions and enriching the fine-grained features. Additionally, by exploring the dependencies between neighboring parts of pedestrian images, the semantic continuity of features is improved. Thanks to the above two aspects, the NDMF-Net achieves the best or second-best performance on the three mainstream datasets, Market-1501, DukeMTMC-ReID, and MSMT17, which demonstrates the effectiveness of our method for person Re-ID.

C. ABLATION STUDY AND HYPERPARAMETERS ANALYSIS 1) ABLATION STUDY
To verify the effectiveness of each component of the NDMF-Net, we conduct ablation experiments on the Market-1501 dataset. Specifically, we first use the backbone network as the baseline, and then add different components to the baseline one by one. Finally, the improvements in model performance are used to demonstrate the effect of each component. The results of the ablation experiments are shown in Table 5.
As can be seen from Table 5, the baseline achieves 92.1% Rank-1 accuracy and 84.6% mAP on the Market-1501 dataset without any additional components. The PCAM improves the model's Rank-1 accuracy and mAP to 94.2% and 87.1%,  respectively, outperforming the baseline by 2.1% and 2.5%. Similarly, the SAM improves the model's Rank-1 accuracy and mAP to 93.9% and 86.9%, respectively, outperforming the baseline by 1.8% and 2.3 %. The NDEM improves the model's Rank-1 accuracy and mAP to 94.8% and 87.6%, respectively, outperforming the baseline by 2.7% and 3.0 %. By jointly utilizing the PCAM, the SAM, and the NDEM, the model achieves 95.8% Rank-1 accuracy and 88.9% mAP, outperforming the baseline by 3.7% and 4.3 %, respectively. The ablation experimental results demonstrate that our proposed components, the PCAM, the SAM, and the NDEM, are effective in facilitating the model to extract richer discriminative features, ultimately improving the performance of the model. This confirms the effectiveness of our proposed NDMF-Net.

2) HYPERPARAMETERS ANALYSIS
We conduct experiments on the Market-1501 dataset to investigate the effect of the number of parts N on model performance, and the results are shown in Fig. 7. The experimental results show that as N gradually increases from small to large, the model performance improves. When N is 8, the NDMF-Net performs the best, achieving 95.8% Rank-1 accuracy and 88.9% mAP. As N continues to increase, model performance decreases.

D. FEATURE MAPS VISUALIZATION
To visually demonstrate the effectiveness of the NDMF-Net, we extract feature maps from pedestrian images in the Market-1501 dataset and visualize them. Fig. 8 shows the visualization results, where Fig. 8(a) shows the original pedestrian images, Fig. 8(b) shows the feature map extracted by the baseline, and Fig. 8(c) shows the feature map extracted by the NDMF-Net. The network's attention to the image is represented by different colors, with deep red indicating the most focused areas and deep blue indicating the least focused areas. Fig. 8 illustrates that the Baseline selectively emphasizes the most salient regions of the image, disregarding other pertinent informative details, causing inadequate feature extraction. Contrarily, in comparison to the Baseline, the NDMF-Net generates salient features not only at the whole image level, but also at the part level by utilizing the CHAM to capture intricate details. Additionally, the NDEM directs the network to extract spatial dependency between neighboring parts of pedestrians, augmenting the semantic coherence of the features extracted. Consequently, the features extracted by our method are more exhaustive and representative.

E. VISUALIZATION OF RETRIEVAL RESULTS
We randomly select a certain number of images from the Market-1501 dataset query set and present their retrieval results in Fig. 9. The green rectangular box indicates a matching ID of the retrieved image with the probe, while the red rectangle indicates a mismatch. The retrieval results demonstrate that our method can retrieve the correct images in most cases.

F. FAILURE ANALYSIS
While our method generally produces correct identification results, we have noted that for pedestrian images with distinct IDs, the retrieval outcomes occasionally incorrectly associate them as the same person. This pattern is especially true when both the background and the pedestrians themselves are highly alike. The cause of this phenomenon is the extreme similarity in the dependency relationships among different body parts of highly similar pedestrians, resulting in inaccurate identification outcomes. This pattern underlines the need for further investigation into methods for extracting distinct local features and dependency patterns in our approach, especially when dealing with highly similar individuals.

V. CONCLUSION
In this paper, we propose a novel NDMF-Net for person Re-ID, which consists of a backbone network, two PHAMs, and two NDEMs. The PHAM is composed of a PCAM and a SAM, which enables the network to focus on important features of different parts in pedestrian images, thus obtaining more discriminative fine-grained features. The NDEM can guide the network to extract dependencies between neighboring parts in pedestrian images, improving the semantic consistency of pedestrian features. Finally, by fusing different features, the final pedestrian representation is obtained. Thanks to these modules, the NDMF-Net can learn more comprehensive and discriminative features and ultimately improve the performance of person Re-ID. Experimental results have shown that our method can achieve better performance in person Re-ID tasks and outperform most existing methods.
Different parts of pedestrian images contain rich detailed information and have potential relationships. In the future, we will continue to investigate the information correlation of different parts and apply it to unsupervised person Re-ID tasks. VOLUME 11, 2023 CHUAN ZHU received the B.Eng. degree from Harbin University, in 2015. He is currently pursuing the Ph.D. degree with Fudan University. His research interests include machine learning, computer vision, and natural language processing. WENJUN ZHOU received the B.Eng. degree in mechanical engineering from Donghua University, in 2018. She is currently pursuing the Ph.D. degree with Fudan University. Her research interests include machine learning and speaker recognition.
YINGJUN ZHU received the B.Eng. and B.BM. degrees, in 2020. He is currently pursuing the M.Sc. degree in general mechanics and mechanics with Fudan University. His research interests include speech emotion recognition and multimodal fusion.
JIANMIN MA received the Ph.D. degree from Xi'an Jiaotong University, in 1998. He is currently a Professor with the Department of Aeronautics and Astronautics, Fudan University. His research interests include mechanical vibration and artificial intelligence.