Full length articleVisible–infrared person re-identification based on key-point feature extraction and optimization☆
Introduction
Person re-identification(Re-ID) [1], [2], [3] aims to find designated persons in unrelated cameras, and has been widely used in urban management, public security prevention, construction of smart cities and other fields. Many recent studies on Re-ID are limited to visible person images shot by a modal camera, and mainly rely on the unique person appearance features under visible conditions. At night or under dim conditions, the camera cannot capture visible person images clearly. In order to work normally in the case of insufficient light, most of the current surveillance cameras can automatically switch from visible light (RGB) modal to near infrared (IR) modal so as to capture infrared person images. RGB images are captured in the visible light environment, so they contain three channel color information, while the infrared images captured in the near-infrared modal only contain one channel invisible information. In this case, the color information cannot be utilized, so it is difficult to match persons across visible and infrared modalities. Recent studies focus on the Euclidean distance between two modal images and realize feature alignment by directly minimizing the Euclidean distance. They are similar to those used in traditional target detection [4], [5], [6]. However, due to the huge visual difference between the two modal images, it is difficult to achieve ideal performance. Inspired by the traditional single-modality person re-Identification, some methods [7], [8] use neural networks to learn global features shared by different modalities, but the differences within the modals may lead to image mismatching. Part based methods [9], [10], [11] find effective local features of images with large modal differences. When there is a big difference between two images, the local features usually contain more invalid information, and can easily lead to mismatch in the matching stage. Inspired by Generative Adversarial Networks(GAN), some methods [12], [13] utilize GAN to generate images and convert images from different modals to the same modal. However, due to the heavy recognition task in Re-ID, the tags of training set and test set may not be shared, therefore, the GAN trained in the test stage may not be able to generate satisfactory images. There are also some methods [14], [15] that use local features and global features at the same time to find the fine-grained and coarse-grained discrimination features. How to make good use of the features of different scales to learn the discriminant features of modal invariance is a problem worth studying. In addition, the color information in the images of different modals is quite different, and there are interference factors such as occlusion and shadow in the environment, which make it difficult for the network model to extract effective features. All these problems make it difficult to extract discriminant features for images from different modalities. Therefore, we propose a multi-hop attention graph convolution network(MAGC) and self-attention semantic perception layer(SSPL) based on person high-order information to solve this problem. In some works [13], [16], [17], it has been proved that key-point features can have strong resistance to noise interference. Therefore, our main idea is to use person key-point features as a unit to construct person graph feature matrix through predefined adjacency matrix so as to keep the relationship between person key-points. The graph convolution network [18] has natural advantages in dealing with topological relationship features, while being supplemented by multi-hop attention structure, it will have good resistance to some noise in the identification of VI-ReID. The goal of SSPL is to independently find more discriminative areas in person features learned from MAGC, and at the same time suppress the adverse effects of irrelevant areas on discrimination. In general, we design a network using two convolutional neural networks for feature extraction to obtain person key-point features, and transform the original scattered person key-point features into person key-point graph features by using the predefined adjacency matrix that preserves the connection between pairs of person key-points. We pass the graph features into MAGC and SSPL to learn high-order person information. In this way, we can extract useful identification information from the limited information of the two modalities. At the same time, feature learning based on key-points can effectively alleviate the problem that local features are unavailable due to noise or occlusion.
Our main contributions are as follows:
(1) We propose a new cross-modal image matching method. The method adopts the feature of person key-points as the unit, and forms the form of graph feature according to the actual structure of person body. The learning of VI-ReID features is facilitated by mining person structure information within and between modals.
(2) We design a multi-hop attention graph convolution network to learn the modality-invariant graph features, and distribute the weights of different layers of the network to enhance the resistance to noise interference.
(3) We use self-attention semantic perception layer to pay more attention to the regions with more information in the discrimination process and suppress the invalid regions at the same time.
(4) We conduct experiments to verify the effectiveness of our proposed method on two VI-ReID datasets and a holistic Re-ID dataset.
Section snippets
Single-modality Person Re-ID
The purpose of Person Re-ID is to find persons with the same identity in the person images taken by different visible cameras. At present, deep learning methods are widely studied [19], [20], [21]. Sun et al. [19] constructed a new baseline network and a partial pooling layer for refinement so as to redistribute some extreme local feature information. He et al. [20] proposed a reconstruction algorithm to learn person features in combination with CNN and replaced the original pixel-level feature
Proposed approach
We propose a two module graph convolution model based on person high-order information. The one is a semantic information extraction module for feature extraction and integration, and the other is a semantic information relationship building module for feature optimization learning. The model connects the key-points according to the actual position of the human body key-points and integrates the key-points into a graph. Then, the feature information in the form of graph is passed into the
Experimental settings
We verify our method on two VI-ReID datasets (RegDB [38] and SYSU-MM01 [27]) and one RGB Re-ID dataset (Market-1501 [39]), and we use mean average precision (mAP) and Cumulative Matching Characteristic (CMC) curves as our evaluation criteria to evaluate the model performance.
RegDB [38] is a VI-ReID dataset captured by a visible camera and a far-infrared camera on the front, back and side of 412 different persons. The total number of dataset pictures are 8240, among which each person have 10 RGB
Conclusions
In this paper, we propose a multi-hop attention graph convolution network(MAGC) and a self-attention semantic perception layer(SSPL) to learn more discriminant features in images. By adding a multi-hop attentional mechanism to the traditional convolution network, we can alleviate the problem of data fitting in the traditional network due to the number of layers in the network. At the same time, a self-attention semantic perception layer is added after the convolution network to make the model
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The work is partially supported by the National Natural Science Foundation of China (Nos. U1836216, 62176144, 62076153), the major fundamental research project of Shandong, China (No. ZR2019ZD03), and the Taishan Scholar Project of Shandong Province, China , China (No. ts20190924).
References (47)
- et al.
Person re-identification based on multi-scale feature learning
Knowl.-Based Syst.
(2021) - et al.
Cross-modality paired-images generation and augmentation for RGB-infrared person re-identification
Neural Netw.
(2020) - et al.
A three-stage learning approach to cross-domain person re-identification
Appl. Soft Comput.
(2021) - et al.
Person re-identification by enhanced local maximal occurrence representation and generalized similarity metric learning
Neurocomputing
(2018) - et al.
Deep learning for person re-identification: A survey and outlook
(2020) - et al.
A survey of open-world person re-identification
IEEE Trans. Circuits Syst. Video Technol.
(2020) - et al.
Unsupervised person re-identification by deep asymmetric metric embedding
IEEE Trans. Pattern Anal. Mach. Intell.
(2020) - et al.
Person reidentification via ranking aggregation of similarity pulling and dissimilarity pushing
IEEE Trans. Multim.
(2016) - Mang Ye, Xiangyuan Lan, Jiawei Li, Pong C. Yuen, Hierarchical Discriminative Learning for Visible Thermal Person...
- et al.
A systematic evaluation and benchmark for person re-identification: Features, metrics, and datasets
IEEE Trans. Pattern Anal. Mach. Intell.
(2019)
Person re-identification by saliency learning
IEEE Trans. Pattern Anal. Mach. Intell.
Group re-identification with multi-grained matching and integration
DCR: A unified framework for holistic/partial person ReID
IEEE Trans. Multim.
Cited by (5)
RGB-T image analysis technology and application: A survey
2023, Engineering Applications of Artificial IntelligenceStronger Heterogeneous Feature Learning for Visible-Infrared Person Re-Identification
2024, Neural Processing LettersModal Invariance Feature Learning and Consistent Fine-Grained Information Mining Based Cross-Modal Person Re-identification
2022, Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial IntelligenceIdentification Technology of Factory Personnel Based on Improved Strong-Baseline
2022, 2022 International Conference on Mechanical and Electronics Engineering, ICMEE 2022Infrared Image Captioning Based on Unsupervised Learning and Reinforcement Learning
2022, 2022 International Conference on Automation, Robotics and Computer Engineering, ICARCE 2022
- ☆
This paper has been recommended for acceptance by Zicheng Liu.