Focusing on shared areas for partial person re-identification

ABSTRACT Person re-identification (Re-ID) can achieve ideal performance based on the prerequisite that the sampling image is complete. However, the whole body cannot be detected because pedestrians may be occluded or are at the edge of the surveillance range in real-world scenarios. Consequently, the image only contains part of the visible information of the pedestrian. When using the standard person re-identification to match the partial image with the complete one, we witness the problem of spatial misalignment and interference caused by missing areas. Hence, we propose a focused shared area model (FSA) for partial re-identification to solve such descriptive problems. We use self-supervised learning to locate the shared area and learn region-level features. In addition, we adopt self-attention mechanism to help the network visualize the important features of the image, thus reducing the influence of the background information. Finally, we verify the effectiveness of our method through experiments on two mainstream datasets: Market-1501, DukeMTMC-reID and two important partial datasets: Partial-REID and Partial-iLIDS.


Introduction
Person re-identification(Re-ID) can be understood as image retrieval at a simple level, specifically, it refers to the retrieval of the same pedestrian under different cameras (Zheng, Yang, and Hauptmann 2016). In recent years, the widespread popularity of surveillance camera equipment and people's high requirements for safety as always also make person re-identification have a very important practical significance, which has attracted more and more researchers Wang et al. 2018b;Yu and Zheng 2020;.
However, due to the reality that pedestrians are partially occluded, and the limited range of the camera's shooting range, some images captured only contain the information of the body parts of the pedestrian in Figure 1. Applying the method based on the standard person-re-identification (Li, Zhu, and Gong 2018;Zheng et al. 2018;Lin et al. 2019;Liu, Chang, and Shen 2020) to the partial person re-identification is not satisfactory. The previous studies on partial person re-identification mainly focused on completing the effective matching of the whole pedestrian image and partial pedestrian image (Sun et al. 2019;He et al. 2018b). The current researches situation have: 1) Sliding window matching (SWM) which constructs part of the pedestrian image in the query set as a sliding window to slide on the overall pedestrian image to locate the most similar area (Zheng et al. 2015b), but it needs to be traversed. The search efficiency is not high. 2) Deep spatial reconstruction (DSR) can match feature maps without being restricted by different image sizes (He et al. 2018a), but due to the need for one-to-one matching, the GPU's tensor computing power utilization is not enough. 3) The partial matching net (PMN) uses the key points of the human skeleton to align the body regions (Iodice and Mikolajczyk 2018), but requires expensive additional clues and annotations. Moreover, the two main problems faced by partial person-reidentification are spatial dislocation and interference in non-shared areas.
In order to address the above problems, we propose a method of focusing the shared area model (FSA). First, we segment the complete image uniformly and use the self-supervised learning to predefine the label of each region. Then, in order to perceive the shared area, we need to classify and predict the pixel based on the generated label. If the pixel belongs to the region, it will get a higher probability, otherwise, the probability value will be very small. Each area corresponds to a probability map, and all the values in the probability map are added to get the visible score of the area. In the feature extraction part, we use the attention mechanism to learn to generate more important features, and weighted pooling the probability map and the visible score with the learned features to obtain the features of each region, and it can be seen that the area with too small score makes almost no contribution to the subsequent similarity calculation. In the testing phase, we first calculate the Euclidean distance between each area separately, and then add the distances between all parts to get the overall distance. Similarly, we also apply the two loss functions commonly used in person re-identification tasks: cross-entropy loss and triplet loss.
In summary, the contributions of our work are as follows: (1) In order to solve the two problems of spatial misalignment and interference in non-shared areas of partial re-identification, we proposed a focused shared area model, which can perceive the shared areas between the complete image and partial image.
(2) Because the background of the pedestrian image is too messy and the scene changes variously, in order to reduce the impact of the background, we adopt a channel attention mechanism to let the network pay more attention to the pedestrian foreground area and collect representative and discriminative features.
(3) The effectiveness of our method is verified on two mainstream holistic datasets and two partial datasets.

Person re-identification based on local feature
The local feature have also been proved to be helpful for person reidentification, which can improve the accuracy of Re-ID by combining with global features (Zhao et al. 2017b;Kumar et al. 2017;Li et al. 2017;Qian et al. 2018). In the previous research work, manual segmentation is a very common method to extract local features (Wang et al. 2018a;Zheng et al. 2019a). Due to the particularity of human body structure, pictures are often divided into several parts along the vertical direction (head, upper body, lower body, etc.), and finally all local features are integrated into a final representation. For instance, PCB ) framework is to evenly divide pedestrian feature map into six pieces, and then conduct loss training on six feature maps, respectively, to predict the ID. However, this kind of method exists the disadvantage of relatively higher image spatial alignment requirement.
Some methods (Sarfraz et al. 2018;Zhao et al. 2017a;Song et al. 2019;Zhu et al. 2020) adopt the body part detectors or human parsing models instead of bounding boxes to locate arbitrary contours of various body parts accurately. For instance, SPReID (Kalayeh et al. 2018) put forward a parsing model with pixel-level accuracy to generate five probability maps (foreground, head, upper body, lower body, and shoes) with different predefined human body regions to calculate more reliable parts representation, and then achieved excellent results on different person re-identification benchmarks.
It is another common practice to align the local characteristics of pedestrians with some prior knowledge, which is mainly the pretrained human pose estimation model and skeleton key point model (Ge et al. 2018;Liu et al. 2018;Miao et al. 2019;Miao, Wu, and Yang 2021;Zhu et al. 2019). Postures are easier to label than human parsing, and there are many different datasets that can be easily generalized. This thesis (Wei et al. 2017) divides the pedestrians image into three parts, namely the head, upper body, and lower body, by using the extracted key points of the human body. Finally, the extracted features merged the global and local features. The thesis produced by Zhao et al. (2017a) generates seven body regions through 14 key points positioning, and then extracts the features of different regions and merges them hierarchically. Unlike Spindle Net, the thesis (Zheng et al. 2019b) first estimated key points of the pedestrian using the pose estimation model, and then used affine transformation to justify the same key points. However, the potential gap between the datasets used for pose estimation and the person re-identification datasets is still a problem.

Partial person re-identification
The holistic person re-identification based on deep learning has made significant research progress and a lot of research results. However, the algorithm based on the holistic person re-identification is no longer applicable for the partial person re-identification. In the partial Re-ID study, the query image is partial, while the gallery image is complete. If the partial image is directly matched with the holistic image, it will lead to spatial dislocation and the interference of non-shared areas. Aiming at the problem of partial person re-identification, a data augmentation method of compound batch erasure is proposed to simulate the occlusions (Yan et al. 2021).
The problem of partial person re-identification was proposed by Zheng et al. (2015b) and the search image was constructed as a sliding window to traverse similar areas. In order to solve the spatial misalignment, He et al. (2018a) proposed DSR to reconstruct the depth space features. Later, they proposed an improved method SFR (He et al. 2018b) based on DSR, which used multi-scale pyramid pooling to enhance the applicability of features to multi-scale images and achieved better results.
All the above methods solve the challenges of partial person reidentification by directly matching the image. Semantic segmentation and pose estimation are also commonly used in partial person re-identification, such as the PMN(Iodice and Mikolajczyk 2018) makes use of the pre-trained human post estimation model to extract the skeleton key point, and calculates the similarity of the shared area based on the shared key points. This type of method can match the shared region of two images more accurately, but requires additional clues and conditions to assist the matching process. To some extent, the performance of such methods depends on the stability and reliability of the priori model and the cost is relatively high.

Attention mechanism
Attention mechanism has been commonly used in the field of natural language processing and computer vision (Ji et al. 2020;Li, Zhu, and Gong 2018;Si et al. 2018;Song et al. 2018), and has achieved great success in the former. In the study of computer vision, the attention mechanism is used to further process visual information. It can help the model assign different weights to each part of the input, extract more critical and important information, and enable the model to make more accurate judgments without incurring greater costs for the calculation and storage of the model. This is also the reason why attention mechanism is so popular. The SENet (Hu, Shen, and Sun 2018) introduced a channel attention mechanism to pay attention to the importance of each channel. Inspired by channel attention and spatial attention, a kind of fused attention CBAM (Woo et al. 2018) is proposed to improve the attention mechanism. The thesis (Liu et al. 2017;Xu et al. 2018) embedded the attention mechanism (Xu et al. 2015) into the network and let the model decide where to focus its attention. A novel method based on pose-guided spatial attention (PGSA) and activation-based attention (AA) is proposed, which can effectively suppress the occluded region and enhance the significance of the visible region (Xu, Zhao, and Qin 2021).

Methodology
Due to the traditional convolutional neural network requires the same size of input samples, and the partial person re-identification dataset cannot meet this requirement, so we removed the fully connected layer in the convolutional neural network. On this basis, the proposed focus shared area model (FSA) of this thesis is constructed, which mainly includes region pre-definition, foreground-aware region feature extraction, similarity measurement and training strategy, as showed in Figure 2. In Section 3.1, the whole image is divided into a fixed number of regions by uniform segmentation, and pseudo labels are assigned to each region by self-supervised learning. In Section 3.2, the location of visible region and feature extraction of foreground region are introduced. The similarity is introduced in Section 3.3. Our training strategy is introduced in section 3.4.

Region pre-defined by self -supervised learning
Inspired by the PCB ), a region can be considered a component as long as it is stable enough. We divide a fixed number of regions horizontally on the whole image, and assign pseudo labels to each pre-defined region through self-supervised learning (Noroozi and Favaro 2016;Wang, He, and Gupta 2017;. The pseudo labels are used as the classification supervision signal, so that the network can easily distinguish the visible regions. Self-supervised learning is a special kind of unsupervised learning method, which can automatically generates supervised signals for feature learning by exploring visual information. Specifically, each region on the input image is projected to the corresponding position on the feature map X through ROI. Assuming that the upper left corner and lower right corner of the region are located at u 1 ; v 1 ð Þ and u 2 ; v 2 ð Þ respectively, the positions on the corresponding feature map X are u 1 C ; respectively, where C is the lower sampling rate. Then, each pixel t in X is assigned a pseudo label L to represent the region that t belongs to, and Z is used to represent the collection of visible regions. Self-supervised learning plays a crucial role adopted in this thesis. It can not only assign labels to each pre-defined region, but also makes the model focus on the visible region and the shared visible region in the subsequent training of classification loss and triplet loss.

Foreground-aware region feature representation
With the hope of eliminating the interference from background and the nonshared region and pedestrian image spatial misalignment, we employ a channel attention mechanism to pay more attention to perceiving the pedestrian prospects. Firstly, we use ResNet101 as the backbone to extract the feature map T from the input image I, and then obtain the foreground Figure 2. The framework of our proposed FSA. Firstly, the holistic image is pre-defined into a fixed number of regions, and the feature map T is generated through the backbone network. Secondly, the foreground feature map X is generated by the channel attention mechanism, and each region is sensed through 1 × 1 convolution layer and softmax function, and the distribution probability map and corresponding visibility score of each region is output. Finally, the feature map X is weighted with the distribution probability maps to generate a fixed number of regional features.
feature map X through foreground perception. Secondly, we use a 1 × 1 convolution layer and a softmax function to classify and predict the region of each pixel t on the feature map X: Where PðM i jtÞ is the predicted probability that t belongs to M i , W is the weight matrix of the 1 × 1 convolution layer, K is the number of pre-defined regions. By sliding each pixel t on feature map X, the corresponding probability of X belonging to each pre-defined region can be predicted, and K probability map are obtained, as shown in Figure 2. The visibility score of each region is predicted by accumulating all the values on T, as shown in Equation (2).
If there are a significant number of pixels in this region, we think it is likely to be visible on the input image and the visibility score will be relatively high. On the other hand, if a region is actually invisible, then all the values on the corresponding probability map will be approximately 0, and the visibility score will also be small. Multiplying the predicted probability map of each region and the feature map obtained by global average pooling to generate the corresponding finegrained features of each region by the following formula 3.
where S i is used to ensure that the number of generated region features is consistent with the number of predefined regions. Even if some regions are actually missing, the corresponding region features will be output in the end, just as the probability map is generated. However, the subsequent similarity measurement is not accepted, as will be explained in Section 3.3.

Similarity measurement
We measure the similarity by matching the common visible region of the query image and the gallery image, and calculated the Euclidean distance of them. The similarity measurement is illustrated in Figure 3.
The finally similarity between the query image and the gallery image is evaluated by calculating D qg .
If the region is not visible, it makes little contribution to the overall distance. Therefore, the overall distance between the images obtained in the end is mainly dominated by the common region.

Training strategy
We adopt a training strategy that uses both cross-entroy loss and triplet loss to train the network model jointly. The cross-entroy loss is utilized to calculate the classification loss of each visible regional feature and the each pixel, which is defined as L id and L pixel : where CE denotes the cross-entroy loss, l is the predicted ID of input image and y represent the ground-truth. Z is the set of all visible region. L is the region pseudo label. In addition, in order to better distinguish similar but different types of input, we adopts triplet loss. Assuming that the three input images are I a ; I p ; I n , where I a ,I p are a pair of positive sample pairs and I a ,I n is a pair of negative sample pairs. It is worth noting that the triplet loss is defined by calculating the distance of the common region between two images as following formula (8): where D ap is the distance of the common region between the positive sample pair and D an is the distance of the common region between the negative sample pair. Z is the set of visible region. D i is the distance of each region. After learning through triplet loss, the distance between positive sample pairs is closer, while the distance between negative sample pairs is farther. It is easy to know that the calculation of formula (9) and formula (10) is similar to formula (5). The difference is that the distance calculation in triplet loss is guided by labels during training, and the similarity measurement is guided by visibility scores during testing.
The final loss function is as follows:

Datasets and metrics
In this thesis, we verify the effectiveness of our proposed method on the two large-scale public holistic datasets named Market1501, DukeMTMC-reID and two important partial datasets called Partial-REID and Partial-iLIDS, respectively. Moreover, we employ two kinds of evaluation metrics including the cumulative matching characteristics (CMC) and mean average precision (mAP). In our experiment, we only use a single query for image retrieval. Market1501: Market1501 dataset (L. Zheng et al. 2015a) is collected by six cameras in the campus of Tsinghua University. The dataset contains a total of 32,668 images and provides the training set and the test set. There are 751 e2031818-1804 people in the training set and 750 people in the test set. The training set contains 12936 images and the test set contains 19732 images. The image is automatically detected and cut by the detector.
DukeMTMC-reID: DukeMTMC-reID (Z.Zheng, Zheng, and Yang 2017) dataset come from 8 different cameras in Duke University. The bounding box of pedestrian images are manually labeled. This dataset consist of training set and test sets. The training set contains 16522 images, and the test set contains 17661 image. There are a total of 702 people in the training data. The dataset provides annotations for pedestrian attributes (gender, backpack, etc).
Partial-REID: The Partial-REID dataset (W. Zheng et al. 2015b) was collected at Sun Yat-Sen University. It included 300 pedestrian images and a total of 30 pedestrians. On average, each pedestrian was manually cropped by five obstructions and five overall images. Due to occluded parts are not the same, the part of the body area obtained is also different. All partial images are query images, and full-body images are test images. Since the dataset is too small, there is no training subset.
Partial-Ilids: The Partial-iLIDS dataset (W.Zheng, Gong, and Xiang 2011) was collected at foreign airports, covering a total of 119 pedestrians and 238 pictures. Each pedestrian has a partial picture and a whole picture. The special feature of this dataset is that most of the occlusions are suitcases carried by pedestrians, so pedestrians in the dataset are mainly the upper body area. Like the Partial-REID dataset, all partial images are used for searching, and all whole images are used for testing. Moreover, in the experiment, generally only the CMC index is used to evaluate the two partial person re-identification datasets. In order to reduce the random error generated by the experiment, it is also necessary to take the average of 10 tests.

Experiment settings
All of our implementations are based on a deep learning framework --PyTorch. We choose to build a focused sharing model based on ResNet101 (He et al. 2016) pre-trained on ImageNet. Because the scale of the partial pedestrian dataset is too small for network training, we take the pictures in the overall pedestrian dataset Martket1501 and DukeMTMC-reID according to a certain proportion R to randomly crop the partial pedestrian images for training. The value range is 0.6-1.0. In our experiment, we set the batch size of each iteration to 64 and use stochastic gradient descent (SGD) to optimize the model. The basic learning rate to 0.1, which attenuated to 0.01 after 30 epochs and ended the training at 140 epochs. We set the momentum and weight decay factors to 0.9 and 0.0005, respectively. We used two loss functions, crossentropy loss and triplet loss to train the network. The evaluation indicators we use are Rank-1, Rank-5, Rank-10, and mAP. All experiments are performed on an NVIDIA TITAN X GPU.

Comparision with state-of-the-art
In order to train the FSA model we proposed, the whole image is cropped with different proportions R, and partial pedestrian images are obtained as input samples. Where R is set as the value within the interval of 0.6-1. When R = 1, it means that the image has not been cropped and the sample is complete. When R = 0.9, it means that the image accounts for 90% of the original image, and so on, different degrees of partial person re-identification can be simulated. The smaller R is, the smaller the size of partial pedestrian image is, and the more difficult it is for partial person re-identification.
Market1501 for holistic person Re-ID: As can be seen from Table 1, the comparison results between the method we proposed and previous methods on Market1501 dataset for holistic person Re-ID. There are three comparison methods: baseline model of holistic person Re-ID, PCB model  based on local feature for holistic person Re-ID, and the more advanced VPM (Sun et al. 2019) for partial Re-ID. Our method achieves mAP = 83.5% and Rank-1 = 94.8%, while the PCB are mAP = 83.0% and Rank-1 = 94.4%. Overall, their performance is similar, indicating that our method is also applicable to the holistic person re-identification.
Market1501 for partial person Re-ID: Under different cropping ratios, the sample images are partially missing, which can be regarded as partial person Re-ID. In addition, the smaller the cropping ratio is, the more parts of the image are cropped, and the less information it contains. As shown in Table 2, when more areas of the input sample are cropped, our method will also become worse in mAP and Rank-1. However, when compared with the other three methods in horizontal comparison, our method has basically improved in case of different R value. It is worth noting that when the cropping ratio is 0.6, our method achieves 65.1% mAP and 80.1% Rank-1 accuracy. Compared with the competitive method VPM, our method is better than it on mAP, but worse than it on Rank accuracy. After analysis, this may be caused by the distribution of Martket1501 dataset. The excellent results in other cases indicate that the method we proposed can alleviate the challenges faced by the partial person re-identification task and is helpful to the partial person reidentification. Result on DukeMTMC-reID: Just like Market1501, the same large DukeMTMC-reID dataset is clipped to different degrees, and the clipping ratio is also the value within the interval of [0.6,1]. When the task is holistic person re-identification, it can be seen from Table 3, our method achieves 86.1% and 74.6% in mAP and Rank −1, respectively, which is also improved compared with PCB model. For the partial person re-identification task, the robustness of the method was tested with different degrees of region deletion. The experimental results are shown in Table 4. Compared with VPM, the method we proposed basically has certain performance improvement on mAP and Rank-1.
Partial-REID&Partial-iLIDS: It can be clearly seen from Tables 5 and 6 that our method is compared with the existing partial person re-identification methods, including MTRC (Liao et al. 2013), AMC+SWM (Zheng et al. 2015b), DSR (He et al. 2018a), SFR (He et al. 2018b) and VPM (Sun et al. 2019). Our method achieves Rank-1 = 73.7% and Rank-3 = 82.7% on Partial Re-ID dataset. The experiment results on Partial-iLIDS are Rank-1 = 68.9% and Rank-3 = 82.4%, which fully prove the superiority of our method and the performance of partial person re-identification has improved.

Ablation analysis
We have adopted a method of uniform image segmentation to pre-define the area of the whole person image. In order to explore the influence of the number of areas on the accuracy of the final re-identification task, a series of ablation experiments were conducted on the Market1501 dataset and the DukeMTMC-reID dataset. Table 7 and Table 8 show the results of the ablation analysis, respectively. K refers to the number of areas and its value are 2, 4, 6, 8. The number of partitioned affects the performance of the network model. When K is 2, no matter on the Rank-1 or mAP, the experimental effect is obviously poor. When K is 4 and 8, the experimental result is also pretty.   However, considering the experimental effect and cost, we finally choose a compromise scheme, which pre-defined six regions on the entire image as the model choice.
The channel attention mechanism allows the model to focus on the foreground area of the target pedestrian. Considering whether the channel attention mechanism (CAM) has promoted our method, we conducted ablation experiments on the market1501 data set. Table 9 shows the results of the ablation analysis, and the best results are bold in the table.  Table 9. Ablation studies on channel attention mechanism(Market1501). 'w/o CAM' means that a CAM module has been removed and 'w/ CAM' means that a CAM module has been added. It can be seen that CAM plays a positive role in this dataset, when R = 1.0, results of mAP of FSA method with CAM improve 0.7%.

Result on CUHK03 dataset(Detected):
The experimental results are shown in Table 10, where RE (Zhong et al. 2017)means random erasing data augmentation.

Conclusion
The goal of partial pedestrian re-recognition is to accurately match the overall image of the pedestrian and the partial image of the pedestrian, but it causes two major problems: spatial misalignment and non-shared area interference. In order to solve these two problems, we focused our attention on the areas shared by the overall image and the partial images. The matching between the shared areas not only does not require alignment but also avoids the interference caused by non-shared areas. Our method mainly includes two parts: region pre-defined and foregroundaware region feature extraction. Firstly, the image is segmented into a fixed number of regions by uniform segmentation, and pseudo-labels are assigned to each region by self-supervised learning, and pseudo-labels are used as supervised signals to guide regional feature learning; Secondly, we use an channel attention mechanism to allow the model to focus on the target pedestrian foreground area, the probability that each pixel belong to each pre-defined region was predicted, and the probability maps and visibility scores were obtained. The features of each area are weighted by pedestrian foreground features and probability maps. Experiments demonstrate the effectiveness and superiority of the proposed method on holistic pedestrian datasets: Market1501, DukeMTMC-reID, and two partial pedestrian datasets: Partial-REID and Partial-iLIDS.