Object Re-Identification Based on Deep Learning

With the explosive growth of video data and the rapid development of computer vision technology, more and more relevant technologies are applied in our real life, one of which is object re-identification (Re-ID) technology. Object Re-ID is cur-rently concentrated in the field of person Re-ID and vehicle Re-ID, which is mainly used to realize the cross-vision tracking of person/vehicle and trajectory prediction. This chapter combines theory and practice to explain why the deep network can re-identify the object. To introduce the main technical route of object Re-ID, the examples of person/vehicle Re-ID are given, and the improvement points of existing object Re-ID research are described separately.


Introduction
In a surveillance camera without overlapping vision, a recognized object is identified again after imaging conditions (including monitoring scene, lighting conditions, object pose, etc.) change, which is called object re-identification (Object Re-ID). Object Re-ID technology has important research significance in intelligent monitoring, multi-object tracking and other fields. In recent years, scholars have paid extensive attention to it. The main application areas of object Re-ID are person Re-ID and vehicle Re-ID.
Person re-identification (Re-ID) is a technology that uses computer vision technology to judge whether there is a specific person in the image or video sequence. It is widely regarded as a sub-problem of image retrieval. Given a monitor person image, retrieve the image of the row of people across the device. It aims to make up for the visual limitations of the current fixed cameras, and can be combined with person detection and pedestrian tracking technology, which can be widely used in intelligent video monitoring, intelligent security and other fields.
Vehicle re-identification (Re-ID) aims to quickly search, locate and track the target vehicles across surveillance camera networks, which plays key roles in maintaining social public security and serves as a core module in the large-scale vehicle recognition, intelligent transportation, surveillance video analytic platforms. Vehicle Re-ID refers to the problem of identifying the same vehicle in a large scale vehicle database given a probe vehicle image. In particular, vehicle Re-ID can be regarded as a fine-grained recognition task that aims at recognizing the subordinate category of a given class. The wide popularization and use of road video monitoring makes vehicle matching based on video image become the hot spot in current intelligent traffic research, and the typical applications are vehicle origindestination analysis and vehicle trajectory reconstruction. In some cases which license plate number could be recognized clearly and accurately, vehicle Re-ID could be realized by match the license plate number. However, in more cases, such as license plate can't be recognized (for most surveillance video), license plate occlusion and so on in the criminal investigation, it is necessary to realize the vehicle Re-ID without license plates by using computer vision and other related technologies.

Related work of object Re-ID
As an emerging research topic, object Re-ID has attracted great efforts. Existing research directions of object Re-ID are mainly divided into person Re-ID and vehicle Re-ID. In this section, we will review the relevant works from person Re-ID and vehicle Re-ID.

Person Re-ID
We will review the relevant work [1] of person Re-ID from following aspects: person Re-ID based on representation learning, metric learning, local features and video sequence.

Person Re-ID based on representation learning
Methods based on representation learning are a kind of very common person Re-ID methods, which is mainly thanks to the deep learning, especially the Convolutional neural network (CNN) development. Sunderrajan et al. [2] propose a clothing context-aware color extraction method to learn color drift patterns in a non-parametric manner using the random forest distance (RFD) function. Geng et al. [3] proposed a person Re-ID algorithm which used Classification loss and verification loss to train the network (including Classification Subnet and Verification Subnet), and the network inputs several pairs of pedestrian images. The classification subnetwork makes ID prediction on the image, and calculates the classification error loss according to the predicted ID. The sub-network integrates the features of two images and judge whether these two images belong to the same pedestrian. The sub-network is essentially equivalent to a binary classification network. After enough data training, input a test image again, and the network will automatically extract a feature, which is used for person Re-ID. For the problem that pedestrian ID information alone is not enough to learn a model with strong generalization ability, the researchers added attributes such as gender, hair and clothing to the pedestrian images. By introducing the pedestrian attribute label, the model should not only accurately predict the pedestrian ID, but also predict the correct pedestrian attributes, which greatly increases the generalization ability of the model. Most papers also show that this method is effective. Lin et al. [4] proposed a person Re-ID algorithm based on multiple attributes. In this algorithm, the features of network output are not only used to predict the ID information of pedestrians, but also to predict the attributes of each pedestrian. The combination of ID loss and attribute loss can improve the generalization ability of the network. Currently, there is still a lot of work based on representational learning. Representational learning has also become a very important baseline of Re-ID field. Moreover, the method of representational learning is more robust, the training is more stable, and the results are easier to reproduce. However, representation learning is easy to be overfitted in the domain of the data set, and when the training ID is increased to a certain extent, it will be weak.

Person Re-ID based on metric learning
Metric learning is a method widely used in the field of image retrieval. Unlike representational learning, metric learning aims to learn the similarity between two images through the network. In the problem of person Re-ID, the similarity of different images of the same pedestrian is greater than that of different images of the different pedestrians. Finally, the loss function of the network makes the distance between the same pedestrian images (positive sample pairs) as small as possible, and the distance between different pedestrian images (negative sample pairs) as large as possible. Common measures of learning loss include Contrastive loss, Triplet loss, Quadruplet loss, Triplet hard loss with batch hard mining (TriHard loss) and Margin sample mining loss (MSML). Varior et al. [5] proposed Siamese Network, and trained the network model by contrast loss. By reducing the contrast loss, the distance between positive sample pairs is gradually reduced, and the distance between negative sample pairs is gradually increased, so as to meet the need of person Re-ID. Triplet loss is a widely used metric learning loss and a lot of metric learning methods have evolved based on triples. Ding et al. [6] considered the re-identification problem as a ranking issue and used triplet loss to obtain the relative distance between images. Chen et al. [7] designed a quadruplet loss process, which can lead to model outputs with larger inter-class variation and smaller intraclass variation compared with the triplet loss method. Hermans et al. [8] proposed a batch training based online difficult sample sampling method, which is named TriHard Loss. Traditional triplet sample mining strategy randomly select three images from training data, and most of the sampled images are simple and easily distinguishable sample pairs, which is not conducive to better representation of network learning. This paper proposes a sample mining strategy that can obtain more difficult samples which can improve the generalization ability of the network. Xiao et al. [9] proposed Margin sample mining loss which introduces the idea of hard sample sampling. MSML losses are calculated by picking only the hardest positive sample pair and the hardest negative sample pair. It is a measure learning method that takes into account both relative distance and absolute distance and introduces the idea of difficult sample sampling.

Person Re-ID based on local features
In the early stage of ReID's research, people still focused on global feature, but later the global feature encountered a bottleneck, so they began to study local feature gradually. The commonly used methods to extract local features include image segmentation, positioning of skeleton key points and posture correction, etc. Image segmentation is a very common way to extract local features. Wei et al. [10] develop a pedestrian image descriptor named Global-Local-Alignment Descriptor, this descriptor explicitly leverages the local and global cues in human body to generate a discriminative and robust representation. In order to solve the failure of manual image slice in the case of image misalignment, some papers first align pedestrians with some prior knowledge, which mainly includes pre-trained human Pose and Skeleton key points model. Su et al. [11] proposed a pose-driven deep convolutional model to alleviate the pose variations and learn robust feature representations from both the global images and different local parts. Liang et al. [12] first estimated the key points of pedestrians with the model of attitude estimation, and then made the same key points align with affine transformation. To extract local features at different scales, they set three different PoseBox combinations; afterwards, the three PoseBox corrected images were sent to the network together with the original corrected images to extract features, which contained both global and local information. In order to solve the problem of local feature alignment, most methods need an additional skeleton key point or pose estimation model. Zhang et al. [13] proposed an automatic alignment model based on SP distance (AlignedReID), which automatically aligned local features without requiring additional information.

Person Re-ID based on video sequence
The main difference between video sequence-based methods is that such methods not only consider the content information of the image, but also consider the motion information between frames. Liu et al. [14] propose an algorithm called Accumulative motion context network (AMOC), the input of AMOC includes the original image sequence and the extracted optical flow sequence. AMOC has Spatial network and Motion network. Each frame of an image sequence is input into Spat Nets to extract the global content features of the image, the two adjacent frames will be sent to the Moti Nets to extract the optical flow pattern features; then the spatial features and optical flow features are merged and input into an RNN to extract the temporal features. Through the AMOC network, each image sequence can be extracted with a feature that integrates content information and motion information. The network adopts classification loss and comparison loss to train the model. Sequential image features with motion information can improve the accuracy of person Re-ID. Mazzeo et al. [15] propose a multi camera architecture for wide area surveillance and a real time people tracking algorithm across non overlapping cameras, they proposed different methodologies [16] to extract the color histogram information from each object patches for the intra-camera and compared different methods to evaluate the colour Brightness Transfer Function (BTF) between non overlapping cameras for inter-camera tracking. This method outperforms the performance in terms of matching rate between different cameras.

Vehicle Re-ID
We will review the relevant works of vehicle Re-ID from three aspects: vehicle re-identification based on artificial design feature, vehicle re-identification based on deep learning feature and vehicle re-identification based on fusion feature.

Vehicle Re-ID based on artificial design feature
In the initial vehicle matching problem, sensor tag matching is adopted. Tian et al. [17] proposed an algorithm for vehicle Re-ID based on multiple sensor nodes. According to the matching results of the same vehicle label obtained by different nodes, the vehicle state was determined and the label segmentation was modified. Meanwhile, the time difference between vehicles was modified according to the relationship between different acquired labels. Coifman [18] proposed a matching algorithm for individual vehicles measured on the highway detector and made corresponding measurements on another detector upstream. Rios-Cabrera et al. [19] proposed a comprehensive scheme for solving the problems of vehicle detection, recognition and tracking in view of the practical application of tunnel monitoring, and proposed compact binary features to improve the recognition effect for the influence of poor imaging conditions and vehicle lights in tunnel monitoring.
Due to the late rise of vehicle Re-ID research, when traditional methods have not been applied to this problem too much, the deep learning technology has developed in a big bang. Almost all subsequent studies are based on deep learning technology, which greatly improves the effect of Re-ID.

Vehicle Re-ID based on deep learning feature
In recent years, convolutional neural network has been widely used in the field of computer vision and achieved remarkable effects. Because the depth features extracted by deep convolutional networks have stronger description ability, more and more scholars have applied them in vehicle Re-ID. Liu et al. [20] proposed a large-scale vehicle Re-ID data set "VeRi," and puts forward a method of feature Fusion FACT by combining the depth of the vehicle network features, color features and SIFT features to match the same vehicle, the follow-up of vehicle recognition of other study, a large number of experiment based on the data set, thereby evaluating effectiveness and superiority of the proposed algorithm. Liu et al. [21] solved the problem of difficulty in triplet loss convergence by adding a feature representation between the sample and each individual vehicle into the triplet network to model intra-class variance. Li et al. [22] proposed DJDL (Deep Joint Discriminative Learning) model, which projects the original vehicle image into Euclidean space through a two-branch Deep convolution network. Zhang et al. [23] proposed a guided Triplet network, which added classification loss to the original triplet loss function and strongly restricted the original training network, thus improving the Re-ID efficiency. Marin et al. [24] designed a metric learning model based on the supervision of the local constraints, its use in pairs and triple constraints to train a network, the network is able to share the same identity of the sample distribution of high similarity, and keep a distance of different identity in the feature space, the algorithm is one of the biggest advantage is to use the vehicle tracking to automatically generate a set of weak tag data, and will automatically generate data sets used in depth training network to complete the vehicles Re-ID task.

Vehicle Re-ID based on fusion feature
For monitoring video, in addition to appearance information of images, information other than appearance features (such as, space-time information) is also of great mining significance. Liu et al. [25] proposed a segmented vehicle Re-ID algorithm, which first used appearance features for preliminary screening, then used license plate information for matching, and finally used spatial and temporal information for reordering. After the method was integrated with spatial and temporal information, the effect was improved to a certain extent. Jiang et al. [26] proposed a vehicle Re-ID algorithm based on multiple attribute training and sort by spatialtemporal similarity, the vehicle image color, models, vehicle feature extraction with individual respectively. Through the fusion of multiple features for the initial Re-ID, the Re-ID results are reordered by the spatial-temporal similarity, and good results are obtained. Shen et al. [27] proposed a two-stage architecture containing complex spatiotemporal information, given a pair of vehicle images with spatiotemporal information, candidate visual spatio-temporal paths (where each visual spatio-temporal state corresponds to an actual vehicle image with spatio-temporal information) are generated by an MRF chain model with a deep learning function, and then candidate paths and paired queries are used to generate their similarity scores for the model. In addition to fusion of information other than the apparent features of images, many scholars have also studied fusion of manual features and deep convolution features, fusion of various attribute features or feature fusion between different image regions. Li et al. [28] proposed a vehicle Re-ID algorithm based on fusion features extract from different part of vehicle, firstly, a part detection algorithm [29] is used to obtain the attention area with big difference between different vehicles. Then, feature extraction was carried out on the detected area, and features of the two areas were fused to generate new fusion features. Liang et al. [30] put forward a new method of supervision and the depth of the hash to handle large-scale vehicle search problem, the use of multitasking learning to learn, vehicle model, vehicle image color depth features of individual ID hash code, the experimental results show the effectiveness of the proposed method, the method in classification loss and triple loss case depth hash method is superior to single task.

Some public database for object Re-ID
With the development of Re-ID research, many scholars have published the data sets of relevant fields. The following are some commonly used person Re-ID data sets and vehicle Re-ID data sets.

CUHK03
The dataset includes 13,164 images of 1360 pedestrians. The whole dataset is captured with six surveillance cameras. Each identity is observed by two disjoint camera views and has an average of 4-8 images in each view. Some examples are shown in Figure 1. Besides the scale, it has the following characteristics.
This dataset is partitioned into training set (1160 persons), validation set (100 persons), and test set (100 persons). Each person has roughly 4-8 photos per view, which means there are almost 26,000 positive training pairs before data augmentation.

Market1501
During dataset collection, a total of six cameras were placed in front of a campus supermarket, including five 1280 Â 1080 HD cameras, and one 720 Â 576 SD camera. Overlapping exists among these cameras. This dataset contains 32,668 boxes of 1501 identities. Due to the open environment, images of each identity are captured by at most six cameras. Each annotated identity is captured by at least two cameras, so that cross-camera search can be performed. Overall, the dataset has the following featured properties.
The dataset is randomly divided into training and testing sets, containing 750 and 751 identities, respectively. During testing, for each identity, it selects one query image in each camera. Note that, the selected queries are hand-drawn, instead of DPM-detected as in the gallery. The reason is that in reality, it is very convenient to interactively draw a box, which can yield higher recognition accuracy. The search process is performed in a cross-camera mode, i.e., relevant images captured in the same camera as the query are viewed as "junk." In this scenario, an identity has at most six queries, and there are 3368 query images in total. Dataset examples are shown in Figure 2.

VRID-1
The open dataset VRID-1 for vehicle re-identification contains 10,000 images, which are captured by 326 surveillance cameras within 14 days. The resolutions of  images are distributed from 400 Â 424 to 990 Â 1134. VRID collects 1000 vehicle IDs (vehicle identities) of top 10 common vehicle models ( Table 1) to reconstruct the interference with the same vehicle model in the real world. The vehicle IDs belong to the same model have very similar appearance and their differences appears in the area of the logo and accessories. Besides, each vehicle IDs contains 10 images which are in various illuminations, poses and weather condition. Dataset examples are shown in Figure 3.
The attributes of VRID is illustrated in Table 2. The vehicle model column represents the vehicle model information. The license plate column is used for the correlation of the same vehicle. The window location column shows the location of vehicle window area. The vehicle color column contains the vehicle color information. Besides, with the rich attributes of vehicles, the dataset could also be used for vehicle fine-grained recognition as well as vehicle color recognition.

VeRi-776
To collect high-quality videos in real-world surveillance scene, we select 20 cameras deployed along a circular road of a 1.0 km 2 area as shown in Figure 4.    Table 2.
The attributes of VRID.  The scenes of the cameras include two-lane roads, four-lane roads, and crossroads. All cameras are set to 1920 Â 1080 resolution and 25 fps. The cameras are deployed with arbitrary orientations and tilt angles. Besides, there are overlaps for part of the cameras. The VeRi dataset is collected with 20 cameras in real-world traffic surveillance environment. A total of 776 vehicles are annotated. Two hundred vehicles are used for testing. The remaining 576 vehicles are for testing. There are 11,579 images in the test set, 1678 images as queries and 37,778 images in the training set. Each vehicle is captured by at least two cameras. One advantage of this data set is that the camera ID and timestamp (frame ID) are reserved with tracks for further annotation. Dataset examples are shown in Figure 5.

General technical route
In deep learning method, the general technical route of object Re-ID includes three stages: data input stage, feature extraction model and distance measurement ( Figure 6).

Data input
Data input mainly refers to feeding data to feature extraction model, and the commonly used data type in object Re-ID is three-channel image. In this part, we do not describe the input data, but mainly introduce data augmentation. In the training stage of deep learning model, insufficient data often leads to the situation that the model cannot converge or overfit. In order to avoid this situation, data augmentation is one of the solutions. Common operations for data augmentation are as follows: • Color Jittering. Color data enhancement, such as Change image brightness, saturation, contrast and so on.
• Random Scale. Randomly change the original size of the image.
• Horizontal/vertical flip. Flip the original image horizontally or vertically.
In the data input stage, we need to pay attention to not only the data amplification, but also, in some special cases such as contrast loss or triplet loss model, we may need to construct the image pair or triplet sample in advance. Due to the limitation of GPU memory, it is impossible to input a batch of data includes all images, so it is possible that there is no negative sample which might result in the failure of image pair or triplet sample construction, at the same time, due to the large number of target individuals in the re-identification problem, the imbalance between positive and negative samples is very likely to exist, which easily leads to the unscientific network model trained. Therefore, we need to set some rules in the data input stage to correctly construct these image pairs or triplet samples.

Feature extraction model
The core of object Re-ID algorithm is feature extraction model, the effectiveness of the whole algorithm is also almost determined by this part. In other words, the essence of Re-ID is to compare the similarity or distance between the features extracted from two images. Image features mainly include color feature, texture feature, shape feature and spatial relationship feature. Feature extraction is a concept in computer vision and image processing. It refers to the use of computer to extract image information to determine whether each image pixel belongs to an image feature. Features are the best way to describe patterns, and we often think that each dimension of a feature can describe a pattern from a different perspective.

Histogram of oriented gradient (HOG)
The essence of HOG feature extraction is to constitute features by computing and statistics the histogram of gradient direction in the local area of the image. Hog feature combined with SVM classifier has been widely used in image recognition, especially in pedestrian detection, which has achieved great success. How to extract HOG feature? Firstly, the image is divided into small connected regions, which are called cell units. Then the direction histogram of the gradient or edge of each pixel in the cell is collected. Finally, these histograms can be combined to form a feature descriptor.

Convolution neural network (CNN)
It is a kind of feedforward neural network with deep structure including convolution calculation. A convolutional neural network contains three types of neural network layers: convolutional layer, pooling layer and fully connection layer. As is shown in Figure 7.

Convolutional layer
The convolution layer is mainly used for learning the feature representation of input data. The convolution layer is composed of multiple convolution kernels, and the convolution operation is carried out on the input image to calculate different feature maps.
In general, the input data is RGB image, as shown in Figure 8. If the color image is 6*6*3, the three refers to three color channels, and the convolution operation is carried out with a 3*3*3 convolution kernel, corresponding to the red, green and blue channels. Take the 27 numbers in turn, multiply them by the Numbers in the corresponding red, green and blue channel, and then add them all up to get the first number in the output of the feature graph.
The convolution layer principle is shown in equation: where f * ð Þ is activation functions; x l j denotes the j À th feature map of output layer l, x l i is the i À th feature map of the layer l; K l ij represents the convolution kernel of the i À th feature graph of the current input layer and the j À th feature graph of the output layer on the layer l; b l j is the bias term of the j À th feature graph in the layer l.

Pooling layer
Pooling layer is often used in the convolutional network to reduce the size of the model, improve the computational speed, and improve the robustness of extracted features. Pooling operation can maintain the invariance of translation, rotation and scale. Common pooling layer operations are averaging and pooling. The maximum pooling operation is as shown in Figure 9. The input of 4*4 is divided into different regions. For the output of 2*2, each element output is the maximum element value in its corresponding color region.

Fully connection layer
Each node of the fully connection layer is connected to all nodes of the previous layer to integrate the features extracted from the previous layer. Due to its fully connected nature, the general fully connected layer also has the most parameters. The full join layer act as a mapping of the learned "distributed feature representation" into the sample tag space. It's essentially a linear transformation from one eigenspace to another eigenspace. Any dimension of the target space is affected by every dimension of the source space. In CNN, the full connection is often found in the last few layers, which is used to make a weighted sum of the features designed before. The schematic diagram of the entire connection layer is shown in Figure 10.

Distance measurement
After feature extraction, we need to compare the distance between the query image and all images in the retrieval set, and there are many ways you can measure the difference between two features, It is divided into distance measure (such as, Euclidean distance, Manhattan Distance etc.) and similarity measure (such as, Cosine Similarity, Jaccard Coefficient, etc.).
Distance measure is used to measure the distance of an individual in space, the greater the distance, the greater the difference between individuals. Similarity measurement is to calculate the degree of similarity between individuals. Contrary to distance measurement, the smaller the value of similarity measurement is, the smaller the similarity between individuals is, and the greater the difference is. Therefore, we can judge which images are more likely to be the same individual by the value of the difference between image features.

One case of person Re-ID
Person Re-ID is a technology that uses computer vision technology to judge whether there is a specific person in the image or video sequence. It is widely regarded as a sub-problem of image retrieval. Given a monitor person image, retrieve the image of the row of people across the device. It aims to make up for the visual limitations of the current fixed cameras, and can be combined with person detection and pedestrian tracking technology, which can be widely used in intelligent video monitoring, intelligent security and other fields. In this section, we show a classic person Re-ID algorithm Part-based Convolutional Baseline (PCB) [43].

Structure of PCB
PCB can take any network without hidden fully-connected layers designed for image classification as the backbone, e.g., Google Inception and ResNet. Original paper employs ResNet50 as the backbone network to reproduce the PCB algorithm.
The structure of PCB illustrated in Figure 11. The input image goes forward through the stacked convolutional layers from the backbone network to form a 3D tensor T. PCB replaces the original global pooling layer with a conventional pooling layer, to spatially down-sample T into p pieces of column vectors g. A following 1 Â 1 kernel-sized convolutional layer reduces the dimension of g. Finally, each dimension-reduced column vector h is input into a classifier, respectively. Each classifier is implemented with a fully-connected (FC) layer and a sequential Softmax layer. During training, each classifier predicts the identity of the input image and is supervised by Cross-Entropy loss. During testing, either p pieces of g or h are concatenated to form the final descriptor of the input image.

Dataset
The original paper tested this algorithm on person Re-ID dataset Market-1501. The Market-1501 dataset contains 1501 identities observed under six camera viewpoints, 19,732 gallery images and 12,936 training images detected by DPM.

Performance comparison
It compares PCB and PCB + RPP with state of the art. Comparisons on Market-1501 are detailed in Table 3. PCB + RPP get mAP = 81.6% and Rank-1 = 93.8% for Market-1501, setting new state of the art on this dataset. All the results are achieved under the single-query mode without re-ranking. Reranking methods will further Figure 11. Structure of PCB [43].
boost the performance especially mAP. For example, when "PCB + RPP" is combined with the reranking method, mAP and Rank-1 accuracy on Market-1501 increases to 91.9 and 95.1%, respectively.

One case of vehicle Re-ID
Considering the spatiotemporal logic of vehicle driving process, we present a vehicle re-identification (Re-ID) algorithm based on multi-camera data's spatiotemporal information and joint learning mechanism without license plate. The algorithm is divided into feature extraction and spatiotemporal re-rank. In the feature extraction stage: on the basis of convolutional neural network (CNN), triplet loss and Softmax loss were used for joint training to model a feature extractor and calculate the feature distance measurement matrix between query image and retrieval set images. In the spatiotemporal re-rank stage: we calculate the spatiotemporal distance matrix and fuse the spatiotemporal distance with the normalized feature distance metric. The final distance measurement matrix is sorted to obtain the vehicle re-identification result. Extensive experiments were carried out on the benchmark datasets "VeRi" to verify the effectiveness of the proposed method and the result have shown that the proposed algorithm outperforms the state-of-the-art approaches for vehicle Re-ID.

Mathematical principles of joint learning
The architecture of proposed algorithm is illustrated in Figure 12. The algorithm is divided into two steps: feature extraction and spatiotemporal re-rank. In the feature extraction phase, triplet loss and Softmax loss are integrated for joint training, triplet loss is used to calculate the distance of the sample features, increasing the distance between the anchor and negative sample, reducing the distance between the anchor and the positive sample. Softmax loss performs label-level  Table 3.
Comparison of the proposed method with the art on Market-1501. supervision and constraint on the feature extraction network. In the spatiotemporal re-rank stage, calculating the spatiotemporal distance between images, and re-rank the retrieval results by merging the spatiotemporal distance and the feature distance.

Triplet loss
In order to learn high discriminative features from images to Euclidean space, where the distance can measure the discrepancy between two images. The idea of learning to rank has gradually been applied to many fields, such as face recognition [12], person Re-ID, and so on. One of the important steps in learning to rank is to find a good similarity function, and triplet loss is a very broad one. In the calculation of the triple loss, the feed data includes an anchor, a positive sample, and a negative sample, and the sample similarity calculation is realized by optimizing the distance between the anchor and the positive sample being smaller than the distance between the anchor and the negative sample. We suppose T ¼  2).
where f x i ð Þ denotes the embedded of the image, α denotes the parameter of expected gap between the distance of x a i ; x p i È É and x a i ; x n i È É

Triplet sampling
This algorithm directly performs on-line triplet mining on image features, which is to compute useful triplets on the fly. For each batch of inputs, given a batch of N examples, we compute the N embeddings and we then can find a maximum of N 3 triplets. For three indices a, p, n ∈ [1, N], if examples a and p have the same label but are distinct, and example n has a different label, we say that (a, p, n) is a valid triplet. We suppose that have a batch of vehicle images as input of size N = PK, composed of P different vehicle ID with K images each. Choose the batch hard strategy: for each anchor, select the hardest positive and the hardest negative among the batch, finally we can obtain PK triplets.

Softmax loss
We impose a strong constraint on distinguishing different vehicle label by adding Softmax loss to the loss function. The embedded obtained by CNN tend to clusters, and the embedded of same vehicle ID will be similar, so the convergence time of triplet loss will be cut down. In Softmax loss stream, each vehicle ID in the training set is considered as a category, the Softmax loss function is formulated as: where m is the total amount of classes, k is the number of training image, 1 * ð Þ is the indicator function (if * is true, then the value set 1, or 0), and θ 0 s are the parameters of the final full-connection layer of the CNN.

Joint multiple loss
The joint learning mechanism is mainly applied to the training phase of vehicle images. After the shared images pass through the shared CNN layer, they are divided into two branch streams, one is subjected to online triplet mining for the calculation of triplet loss, and the other stream enters the Softmax layer for Softmax loss calculation. The final joint learning loss function can be formulated as: 6.2 Experimental results

Dataset
In order to verify the validity of the algorithm, we conduct experiments in the latest version of the vehicle re-identification dataset "VeRi." The dataset has a total of 49,357 images, which are taken for actual road monitoring and contains various angles and various vehicle models, as shown in Figure 13.
It is divided into two subsets for training and testing. The train set has 576 vehicle IDs with 37,778 images and the test set has 200 vehicle IDs with 11,579 images. For the vehicle re-identification task, we divided the test set to query set (1678 images) and retrieval set (9901 images).

Experimental setting
All of the experiments are based on the deep learning framework Tensorflow. The base network is VGG_CNN_M, and the model was pre-trained on the ImageNet. In the calculation of triplet loss, we set α ¼ 1, the learning rate is set to 0.001, and the mini-batch is set to 32. In order to evaluate the effect of this algorithm objectively, we set up two algorithms to compare with the method proposed to verify that the improvement of the algorithm. These algorithms are: (1) VGG + Softmax loss; (2) VGG + Triplet loss; (3) VGG + Softmax loss + Triplet loss (our method). All of network based on VGG16, "Softmax loss" denotes use Softmax loss to train the network, and "Triplet loss" denotes use triplet loss to train the network. At the same time, we also make comparison our experiment results with some state-of-the-art algorithms on the same dataset "VeRi."

Performance comparison on VeRi dataset
We conduct the experiment as described in experimental setting, and use cumulative matching curve (CMC), HIT@1, HIT@5 as metrics to evaluate the performance. In our method, S, T denote using Softmax loss and using triplet loss respectively. Table 4 and Figure 14 illustrate the performances of the proposed methods and some state-of-the-art algorithms in vehicle Re-ID field.
The results show that the proposed method "VGG + S + T" achieves the best results, the HIT@1 and HIT@5 hit 89.75 and 95.05% respectively. It is obvious that the CNN-based method has a significant improvement over the handcraft featurebased approach when compare BOW-CN and LOMO algorithm with other algorithms based on CNN feature. Compared with "VGG + S" which only utilizes Softmax loss, our method has much better results, improving 16.81% in HIT@1 and 11.38% in HIT@5. Compared with "VGG + T" which only utilizes triplet loss, our method makes improvement about 16.81% in HIT@1 and 8.67% in HIT@5. Compare to "FACT + Plate-SNN + STR" which additionally utilizes license plate information (Plate-SNN) and spatiotemporal relation (STR), our method improves 28.31% in HIT@1 and 16.27% in HIT@5. In summary, the proposed algorithm is feasible in vehicle re-identification task, and achieves outstanding results compared to other algorithms.

Summary
This chapter mainly introduces the concept of object Re-ID and two core applications: person Re-ID and vehicle Re-ID. In this chapter, the definitions of person Re-ID and vehicle Re-ID are given, some research methods of the two applications are reviewed, and the commonly used public data sets are described in detail.
In this chapter, the general process of object Re-ID by deep learning method is given, and the data input, feature extraction network structure, distance measurement and other parts are described in detail. At the same time, two examples are given to illustrate the algorithm in detail and experiment comparison. Person Re-ID refers to the network structure and experimental results of PCB algorithm [43]. Vehicle Re-ID is introduced in detail in terms of feature extraction and measurement calculation. The influence of parameters in the deep learning method is illustrated through the analysis of experimental results, and the evaluation comparison is given.
These can help relevant researchers to understand the context of technology, the general implementation process, as well as important parameters and evaluation indicators in this field, so that they can quickly start relevant research.
The object Re-ID is the basis of realizing cross-camera tracking. Person and vehicles are just two typical applications. In the future, with the gradual solution of the following problems, we will have a more extensive application: • High-quality standard database is important to generalization performance of Re-ID algorithm. The database should be more suitable for the real environment and including different and varying scenes.
• Deep networks have poor interpretability. Although the deep learning method has achieved good performance in Re-ID tasks, few studies have shown which information has a greater impact on Re-ID behind the continuous improvement in accuracy. • At present, most methods are carried out under the prior condition that object has been detected, but this requires a very robust detection model. We need to combine object Re-ID with object detection, which is more in line with practical application requirements.
• The research should focus on semi-supervised, unsupervised and transfer learning methods. The collected data are limited after all, and the cost of labeling data is also very high. Therefore, although the semi-supervised and unsupervised learning methods may not be as good as the supervised learning methods in terms of performance, they are valuable.