Keywords

1 Introduction

Metric learning methods aim at learning a parametrized distance function from a labeled set of samples, so that under the learned distance, samples with the same labels are nearby and samples with different labels are far apart [1]. Many fundamental questions in computer vision such as “How to compare two images? and for what information?” boil down to this problem. Among them, person re-identification is the problem of recognizing individuals at different physical locations and times, on images captured by different devices.

It is a challenging problem which recently received a lot of attention because of its importance in various application domains such as video surveillance, biometrics, and behavior analysis [2].

The performance of person re-identification systems relies mainly on the image feature representation and the distance measure used to compare them. Hence the research in the field has focused either on designing features [3, 4] or on learning a distance function from a labeled set of images [49].

It is difficult to analytically design features that are invariant to the various non-linear transformations that an image undergoes such as illumination, viewpoint, pose changes, and occlusion. Furthermore, even if such features were provided, the standard Euclidean metric would not be adequate as it does not take into account dependencies on the feature representation. This motivates the use of metric learning for person re-identification.

Re-identification models are commonly evaluated by the cumulative match characteristic (CMC) curve [6]. This measure indicates how the matching performance of the algorithm improves as the number of returned image increases. Given a matching algorithm and a labeled test set, each image is compared against all the others, and the position of the first correct match is recorded. The CMC curve indicates for each rank the fraction of test samples which had that rank or better. A perfect CMC curve would reach the value 1 for rank \(\#1\), that is the best match is always of the correct identity.

In this paper we are interested in learning a Mahalanobis distance by minimizing a weighted rank loss such that the precision at the top rank positions of the CMC curve is maximized. When learning the metric, we directly learn the low-rank projection matrix instead of the PSD matrix because of the computational efficiency and the scalability to high dimensional datasets (see Sect. 3.1). But naively learning the low-rank projection matrix suffers from the problem of matrix rank degeneration and non-isolated minima [10]. We address this problem by using a simple regularizer which approximately enforces the orthonormality of the learned matrix very efficiently (see Sect. 3.2). We extend the WARP loss [1012] and combine it with our approximate orthonormal regularizer to derive a metric learning algorithm which approximately minimizes a weighted rank loss efficiently using stochastic gradient descent (see Sect. 3.3).

We extend our model to kernel space to handle distance measures which are more natural for the features we are dealing with (see Sect. 3.4). We also show that in kernel space SGD can be carried out more efficiently by using preconditioning [5, 13].

We validate our approach on nine person re-identification datasets: Market-1501 [14], CUHK03 [15], OpeReid [16], CUHK01 [17], VIPeR [18], CAVIAR [3], 3DPeS [19], iLIDS [20] and PRI450s [21], where we outperform other metric learning methods proposed in the literature, both in speed and accuracy.

2 Related Works

Metric learning is a well studied research problem [22]. Most of the existing approaches have been developed in the context of the Mahalanobis distance learning paradigm [1, 5, 6, 23, 24]. This consists in learning distances of the form:

$$\begin{aligned} \mathcal {D}_{M}^2(x_i, x_j) = (x_i-x_j)^TM(x_i-x_j), \end{aligned}$$
(1)

where M is a positive semi-definite matrix. Based on the way the problem is formulated the algorithms for learning such distances involve either optimization in the space of positive semi-definite (PSD) matrices, or learning the projection matrix W, in which case \(M = W^TW\).

Large margin nearest neighbors [1] (LMNN) is a metric learning algorithm designed to maximize the performance of k-nearest neighbor classification in a large margin framework. Information theoretic metric learning [24] (ITML) exploits the relationship between the Mahalanobis distance and Gaussian distributions to learn the metric. Many researchers have applied LMNN and ITML to re-identification problem with varying degree of success [21].

Pairwise Constrained Component Analysis (PCCA) [5] is a metric learning method that learns the low rank projection matrix W in the kernel space from sparse pairwise constraints. Xiong et al. [8] extended PCCA with a \(L_2\) regularization term and showed that it further improves the performance.

Köstinger et al. [6] proposed the KISS (“Keep It Simple and Straight forward”) metric learning abbreviated as KISSME. Their method enjoys very fast training and they show good empirical performance and scaling properties along the number samples. However this method suffers from of the Gaussian assumptions on the model.

Li et al. [7] consider learning a local thresholding rule for metric learning. This method is computationally expensive to train, even with as few as 100 dimensions.

The performance of many kernel-based metric learning methods for person re-identification was evaluated in [8]. In particular the authors evaluated PCCA [5], variants of kernel Fisher discriminant analysis (KFDA) and reported that the KFDA variants consistently out-perform all other methods. The KFDA variants they investigated were Local Fisher Discriminant Analysis (LFDA) and Marginal Fisher Discriminant Analysis (MFA).

Chen et al. [25] attempt to learn a metric in the polynomial feature map exploiting the relationship between Mahalanobis metric and the polynomial features. Ahmed et al. [26] propose a deep learning model which learns the features as well as the metric jointly. Liao et al. [4] propose XQDA exploiting the benefits of Fisher discriminant analysis and KISSME to learn a metric. However like FDA and KISSME, XQDA’s modeling power is limited because of the Gaussian assumptions on the data. In another work Liao et al. [9] apply accelerated proximal gradient descent (APGD) to a Mahalanobis metric under a logistic loss similar to the loss of PCCA [5]. The application of APGD makes this model converge fast compared to existing batch metric learning algorithms but still it suffers from scalability issues because all the pairs are required to take one gradient step and the projection step on to the PSD cone is computationally expensive.

None of the above mentioned techniques explicitly models the objective that we are looking for in person re-identification, that is to optimize a weighted rank measure. We show that modeling this in the metric learning objective improves the performance. We address scalability through stochastic gradient descent (SGD) and our model naturally eliminates the need for asymmetric sample weighting as we use triplet based loss function.

There is an extensive body of work on optimizing ranking measures such as AUC, precision at k, \(F_{1}\) score, etc. Most of this work focuses on learning a linear decision boundary in the original input space, or in the feature space for ranking a list of items based on the chosen performance measure. A well known such model is the structural SVM [27]. In contrast here we are interested in ranking pairs of items by learning a metric. A related work by McFee et al. [28] studies metric learning with different rank measures in the structural SVM framework. Wu et al. [29] used this framework to do person re-identification by optimizing the mean reciprocal rank criterion. Outside the direct scope of metric learning from a single feature representation, Paisitkriangkrai et al. [30] developed an ensemble algorithm to combine different base metrics in the structural SVM framework which leads to excellent performance for re-identification. Such an approach is complementary to ours, as combining heterogeneous feature representations requires a separate additional level of normalization or the combination with a voting scheme.

Table 1. Notation

We use the WARP loss from WSABIE [12], proposed for large-scale image annotation problem, that is a multi-label classification problem. WSABIE learns a low dimensional joint embedding for both images and annotations by optimizing the WARP loss. This work reports excellent empirical results in terms of accuracy, computational efficiency, and memory footprint.

The work that is closely related to us is FRML [10] where they learn a Mahalanobis metric by optimizing the WARP loss function with SGD. However there are some key differences with our approach. FRML is a linear method using \(L_2\) or LMNN regularizer, and relies on an expensive projection step in the SGD. Beside, this projection requires to keep a record of all the gradients in the mini-batch, which results in high memory footprint. The rationale for the projection step is to accelerate the SGD because directly optimizing low rank matrix may result in rank deficient matrix and thus result in non-isolated minimizers which might generalize poorly to unseen samples. We propose a computationally cheap solution to this problem by using a regularizer which approximately enforces the rank of the learned matrix efficiently.

3 Weighted Approximate Rank Component Analysis

This section presents our metric learning algorithm, Weighted Approximate Rank Component Analysis (WARCA). Table 1 summarizes some important notations that we use in the paper.

Let us consider a training set of data point/label pairs:

$$\begin{aligned} (x_n, y_n) \in \mathbb {R}^D \times \{ 1, \dots , Q \}, \ n = 1, \dots ,N. \end{aligned}$$
(2)

and let \(\mathcal {S}\) be the set of pairs of indices of samples of same labels:

$$\begin{aligned} \mathcal {S} = \left\{ (i, j) \in \{1, \dots , N \}^{2}, \ y_i = y_j \right\} . \end{aligned}$$
(3)

For each label y we define the set \(\mathcal {T}_{{y}}\) of indices of samples of a class different from y:

$$\begin{aligned} \mathcal {T}_{{y}} = \left\{ k \in \{1, \dots , N \}, \ y_k \ne {y} \right\} . \end{aligned}$$
(4)

In particular, to each \((i, j) \in \mathcal {S}\) corresponds a set \(\mathcal {T}_{y_{i}}\).

Let W be a linear transformation that maps the data points from \(\mathbb {R}^D\) to \(\mathbb {R}^{D'}\), with \(D' \le D\). For the ease of notation, we do not distinguish between matrices and their corresponding linear mappings. The distance function under the linear map W is given by:

$$\begin{aligned} \mathcal {F}_{W}(x_i, x_j) = \Vert W (x_i - x_j ) \Vert _2. \end{aligned}$$
(5)

3.1 Problem Formulation

For a pair of points (ij) of same label \(y_i = y_j\), we define a ranking error function:

$$\begin{aligned} \forall (i, j) \in \mathcal {S}, \ err(\mathcal {F}_{W}, i, j) = L\left( rank_{i,j}\left( \mathcal {F}_{W} \right) \right) \end{aligned}$$
(6)

where:

$$\begin{aligned} rank_{i,j}\left( \mathcal {F}_{W} \right) = \sum _{k \in \mathcal {T}_{{y_{i}}}} \mathbbm {1}_{\mathcal {F}_{W}(x_i, x_k) \le \mathcal {F}_{W}(x_i, x_j)}. \end{aligned}$$
(7)

is the number of samples \(x_k\) of different labels which are closer to \(x_i\) than \(x_j\) is.

Formulating our objective that way, following closely the formalism of [12], shows how training a multi-class predictor shares similarities with our metric-learning problem. The former aims at avoiding, for any given sample to have incorrect classes with responses higher than the correct one, while the latter aims at avoiding, for any pair of samples \((x_i, x_j)\) of the same label, to have samples \(x_k\) of other classes in between them.

Minimizing directly the rank treats all the rank positions equally, and usually in many problems including person re-identification we are interested in maximizing the correct match within the top few rank positions. This can be achieved by a weighting function \({{L}}(\cdot )\) which penalizes more a drop in the rank at the top positions than at the bottom positions. In particular we use the rank weighting function proposed by Usunier et al. [11], of the form:

$$\begin{aligned} {{L}}(r) = \sum _{s=1}^{r} \alpha _s, \ \alpha _1 \ge \alpha _2 \ge ... \ge 0. \end{aligned}$$
(8)

For example, using \(\alpha _1=\alpha _2=\,...\,=\alpha _m\) will treat all rank positions equally, and using higher values of \(\alpha \)s in top few rank positions will weight top rank positions more. We use the harmonic weighting, which has such a profile and was also used in [12] as it yielded state-of-the-art results on their application.

Finally, we would like to solve the following optimization problem:

$$\begin{aligned} \mathop {\text {argmin}}\limits _{W} \ \frac{1}{|\mathcal {S}|} \sum _{(i, j) \in \mathcal {S}} {{L}}\left( rank_{i,j}\left( \mathcal {F}_{W} \right) \right) . \end{aligned}$$
(9)

3.2 Approximate OrthoNormal (AON) Regularizer

The optimization problem of Eq. 9 may lead to severe over-fitting on small and medium scale datasets. Regularizing penalty terms are central in re-identification for that reason.

The standard way of regularizing a low-rank metric learning objective function is by using a \(L_2\) penalty, such as the Frobenius norm [10]. However, such a regularizer tends to push toward rank-deficient linear mappings, which we observe in practice (see Sect. 4.4, and in particular Fig. 2a).

Lim et al. [10] in their FRML algorithm, addresses this problem by using a Riemannian manifold update step in their SGD algorithm, which is computationally expensive and induces a high memory footprint. We propose an alternative approach that maintains the rank of the matrix by pushing toward orthonormal matrices. This is achieved by using as a penalty term the \(L_2\) divergence of \(W W^T\) from the identity matrix \(\mathbf {I}\):

$$\begin{aligned} \Vert W W^T - \mathbf {I}\Vert ^2. \end{aligned}$$
(10)

This orthonormal regularizer can also be seen as a strategy to mimic the behavior of approaches such as PCA or FDA, which ensure that the learned linear transformation is orthonormal. For such methods, this property emerges from the strong Gaussian prior over the data, which is beneficial on small data-sets but degrades performance on large ones where it leads to under-fitting. Controlling the orthonormality of the learned mapping through a regularizer weighted by a meta-parameter \(\lambda \) allows us to adapt it on each data-set individually through cross-validation.

Finally, with this regularizer the optimization problem of Eq. 9 becomes:

$$\begin{aligned} \mathop {\text {argmin}}\limits _{W} \ \frac{\lambda }{2} \Vert W W^T - \mathbf {I} \Vert ^2 + \frac{1}{|\mathcal {S}|} \sum _{(i, j) \in \mathcal {S}} {{L}}\left( rank_{i,j}\left( \mathcal {F}_{W} \right) \right) . \end{aligned}$$
(11)

3.3 Max-Margin Reformulation

The metric learning problem in Eq. 11 aims at minimizing the 0-1 loss, which is a difficult optimization problem. Applying the reasoning behind the WARP loss to make it tractable, we upper-bound this loss with the hinge one with margin \(\gamma \). This is equivalent to minimizing the following loss function:

$$\begin{aligned} \begin{aligned} \mathcal {L}(W) = \, \,&\frac{\lambda }{2}\Vert WW^T - \mathbf {I}\Vert ^2 \ + \frac{1}{|\mathcal {S}|} \sum _{(i, j) \in \mathcal {S}} \sum _{k \in \mathcal {T}_{{y_{i}}}} {{L}}({rank}^\gamma _{i,j}(\mathcal{F}_W))\frac{\left| \gamma + \xi _{{i}{j}{k}} \right| _+}{{rank}^\gamma _{i,j}(\mathcal{F}_W)}, \end{aligned} \end{aligned}$$
(12)

where:

$$\begin{aligned} \xi _{{i}{j}{k}} = \mathcal {F}_{W}(x_{i}, x_{j}) - \mathcal {F}_{W}(x_{i}, x_{k}) \end{aligned}$$
(13)

and \({rank}^\gamma _{i,j}(\mathcal{F}_W)\) is the margin penalized rank:

$$\begin{aligned} {rank}^\gamma _{i,j}(\mathcal{F}_W) = \sum _{k \in \mathcal {T}_{{y_{i}}}} \mathbbm {1}_{\gamma \!+\! \xi _{{i}{j}{k}} > 0}. \end{aligned}$$
(14)

The loss function in Eq. 12 is the WARP loss [1012]. It was shown by Weston et al. [12] that the WARP loss can be efficiently solved by using stochastic gradient descent and we follow the same approach:

  1. 1.

    Sample (ij) uniformly at random from \(\mathcal {S}\).

  2. 2.

    For the selected (ij) uniformly sample k in \(\left\{ k \in \mathcal {T}_{{y_{i}}}: \gamma + \xi _{{i}{j}{k}} > 0 \right\} \), i.e. from the set of incorrect matches scored higher than the correct match \(x_j\).

The sampled triplet (ijk) has a contribution of \({{L}}({rank}^\gamma _{i,j}(\mathcal{F}_W))|\gamma + \xi _{{i}{j}{k}} |_+\) because the probability of drawing a k in step 2 from the violating set is \(\frac{1}{{rank}^\gamma _{i,j}(\mathcal{F}_W)}\).

We use the above sampling procedure to solve WARCA efficiently using mini-batch stochastic gradient descent (SGD). We use Adam SGD algorithm [31], which is found to converge faster empirically compared to vanilla SGD.

3.4 Kernelization

Most commonly used features in person re-identification are histogram-based such as LBP, SIFT BOW, RGB histograms to name a few. The most natural distance measure for histogram-based features is the \(\chi ^{2}\) distance. Most of the standard metric learning methods work on the Euclidean distance with PCCA being a notable exception. To plug any arbitrary metric which is suitable for the features, such as \(\chi ^2\), one has to resort to explicit feature maps that approximate the \(\chi ^{2}\) metric. However, it blows up the dimension and the computational cost. Another way to deal with this problem is to do metric learning in the kernel space, which is the approach we follow.

Let W be spanned by the samples:

$$\begin{aligned} W = A X^T = A \left( \begin{array}{c} x_1^T \\ \dots \\ x_N^T \end{array} \right) . \end{aligned}$$
(15)

which leads to:

$$\begin{aligned} \mathcal {F}_{A}(x_i, x_j)&= \Vert A X^T(x_i - x_j) \Vert _2, \end{aligned}$$
(16)
$$\begin{aligned}&= \Vert A ({\kappa }_i - {\kappa }_j) \Vert _2. \end{aligned}$$
(17)

Where \({\kappa }_i\) is the \(i^{th}\) column of the kernel matrix \(K = X^TX\). Then the loss function in Eq. 12 becomes:

$$\begin{aligned} \mathcal {L}(A)= \frac{\lambda }{2}\Vert AKA^T-\mathbf {I}\Vert ^2 + \frac{1}{|\mathcal {S}|}\sum _{(i, j) \in \mathcal {S}} \sum _{k \in \mathcal {T}_{{y_{i}}}} {{L}}({rank}^\gamma _{i,j}(\mathcal{F}_A)) \frac{|\gamma +\xi _{{i}{j}{k}} |_+}{{rank}^\gamma _{i,j}(\mathcal{F}_A)}, \end{aligned}$$
(18)

with:

$$\begin{aligned} \xi _{{i}{j}{k}} = \mathcal {F}_{A}(x_{i}, x_{j}) - \mathcal {F}_{A}(x_{i}, x_{k}). \end{aligned}$$
(19)

Apart from being able to do non-linear metric learning, kernelized WARCA can be solved efficiently again by using stochastic sub-gradient descent. If we use the inverse of the kernel matrix as the pre-conditioner of the stochastic sub-gradient, the computation of the update equation, as well the parameter update, can be carried out efficiently. Mignon et al. [5] used the same technique to solve their PCCA, and showed that it converges faster than vanilla gradient descent. We use the same technique to derive an efficient update rule for our kernelized WARCA. A stochastic sub-gradient of Eq. 18 with the sampling procedure described in the previous section is given as:

$$\begin{aligned} \nabla \mathcal {L}(A) = 2\lambda (AKA^T-\mathbf {I})AK + 2{{L}}({rank}^\gamma _{i,j}(\mathcal{F}_A))A\mathbbm {1}_{{\gamma + \xi _{{i}{j}{k}} > 0}} \mathcal {G}_{ijk} , \end{aligned}$$
(20)

where:

$$\begin{aligned} \mathcal {G}_{ijk} = \frac{({\kappa }_{i}-{\kappa }_{j})({\kappa }_{i}-{\kappa }_{j})^T}{d_{ij}} - \frac{({\kappa }_{i}-{\kappa }_{k})({\kappa }_{i}-{\kappa }_{k})^T}{d_{ik}}, \end{aligned}$$
(21)

and:

$$\begin{aligned} d_{ij} = \mathcal {F}_{A}(x_{i}, x_{j}), \ \ d_{ik} = \mathcal {F}_{A}(x_{i}, x_{k}). \end{aligned}$$
(22)

Multiplying the right hand side of Eq. 20 by \(K^{-1}\):

$$\begin{aligned} \nabla \mathcal {L}(A)K^{-1} = 2\lambda (AKA^T-\mathbf {I})A + 2{{L}}({rank}^\gamma _{i,j}(\mathcal{F}_A))AK\mathbbm {1}_{{\gamma + \xi _{{i}{j}{k}} > 0}}\mathcal {E}_{ijk} . \end{aligned}$$
(23)

with:

$$\begin{aligned} \mathcal {E}_{ijk} = K^{-1}\mathcal {G}_{ijk} K^{-1} = \frac{(e_{i}\!-\!e_{j})(e_{i}\!-\!e_{j})^T}{d_{ij}} - \frac{(e_{i}\!-\!e_{k})(e_{i}\!-\!e_{k})^T}{d_{ik}} . \end{aligned}$$
(24)

where \(e_l\) is the \(l^{th}\) column of the canonical basis that is the vector whose \(l^{th}\) component is one and all others are zero. In the preconditioned stochastic sub-gradient descent we use the updates of the form:

$$\begin{aligned} A_{t+1} = (\mathbf {I} - 2\lambda \eta (A_{t}KA_{t}^T-\mathbf {I}))A_{t} - 2\eta {{L}}({rank}^\gamma _{i,j}(\mathcal{F}_A))A_{t}K\mathbbm {1}_{\gamma + \xi _{{i}{j}{k}} > 0}\mathcal {E}_{ijk}. \end{aligned}$$
(25)

Please note that \(\mathcal {E}_{ijk}\) is a very sparse matrix with only nine non-zero entries. This makes the update extremely fast. Preconditioning also enjoys faster convergence rates since it exploits second order information through the preconditioning operator, here the inverse of the kernel matrix [13].

4 Experiments

We evaluate our proposed algorithm on nine standard person re-identification datasets. We first describe the datasets and baseline algorithms and then present our results. Our code will be made publicly available.

4.1 Datasets and Baselines

The largest dataset we experimented with is the Market-1501 dataset [14] which is composed of 32,668 images of 1,501 persons captured from 6 different view points. It uses DPM [32] detected bounding boxes as annotations. CUHK03 dataset [15] consists of 13,164 images of 1,360 persons and it has both DPM detected and manually annotated bounding boxes. We use the manually annotated bounding boxes here. OpeReid dataset [16] consists of 7,413 images of 200 persons. CUHK01 dataset [17] is composed of 3,884 images of 971 persons, with two pairs of images per person, each pair taken from a different viewpoint. VIPeR [18] dataset has 1,264 images of 632 person, with 2 images per person. The PRID450s dataset [21] consists of 450 image pairs recorded from two different static surveillance cameras. The CAVIAR dataset [3] consists of 1,220 images of 72 individuals from 2 cameras in a shopping mall. The 3DPeS dataset [19] has 1,011 images of 192 individuals, with 2 to 6 images per person. The dataset is captured from 8 outdoor cameras with horizontal but significantly different viewpoints. Finally the iLIDS dataset [20] contains 476 images and 119 persons, with 2 to 8 images per individual.

We compare our method against the current state-of-the-art baselines MLAPG, rPCCA, SVMML, FRML, LFDA and KISSME. A brief overview of these methods is given in Sect. 2. rPCCA, MLAPG, SVMML, FRML are iterative methods whereas LFDA and KISSME are spectral methods on the second order statistics of the data. Since WARCA, rPCCA and LFDA are kernel methods we used both the \(\chi ^2\) kernel and the linear kernel with them to benchmark the performance. Marginal Fisher discriminant analysis (MFA) is proven to give similar result as that of LFDA so we do not use them as the baseline.

We did not compare against other ranking based metric learning methods such as LORETA [33], OASIS [34] and MLR [28] because all of them are linear methods. In fact we derived a kernelized OASIS but the results were not as good as ours or rPCCA. We also do not compare against LMNN and ITML because many researchers have evaluated them before [57] and found out that they do not perform as well as other methods considered here.

Table 2. Table showing the rank 1, rank 5 and AUC performance measure of our method WARCA against other state-of-the-art methods. Bold fields indicate best performing methods. The dashes indicate computation that could not be run in a realistic setting on Market-1501

4.2 Technical Details

For the Market-1501 dataset we used the experimental protocol and features described in [14]. We used their baseline code and features. As Market-1501 is quite large for kernel methods we do not evaluate them. We also do not evaluate the linear methods such as Linear rPCCA and SVMML because their optimization algorithms were found to be very slow.

All other evaluations where carried out in the single-shot experiment setting [2] and our experimental settings are very similar to the one adopted by Xiong et al. [8]. Except for Market-1501, we randomly divided all the other datasets into two subsets such that there are p individuals in the test set. We created 10 such random splits. In each partition one image of each person was randomly selected as a probe image, and the rest of the images were used as gallery images and this was repeated 10 times. The position of the correct match was processed to generate the CMC curve. We followed the standard train-validation-test splits for all the other datasets and P was chosen to be 100, 119, 486, 316, 225, 36, 95 and 60 for CUHK03, OpeReid, CUHK01, VIPeR, PRID450s, CAVIAR, 3DPeS and iLIDS respectively.

We used the same set of features for all the datasets except for the Market-1501 and all the features are essentially histogram based. First all the datasets were re-scaled to \(128\times 48\) resolution and then 16 bin color histograms on RGB, YUV, and HSV channels, as well as texture histogram based on Local Binary Patterns (LBP) were extracted on 6 non-overlapping horizontal patches. All the histograms are normalized per patch to have unit \(L_1\) norm and concatenated into a single vector of dimension 2,580 [5, 8].

Fig. 1.
figure 1

CMC curves comparing WARCA against state-of-the-art methods on nine re-identification datasets

The source codes for LFDA, KISSME and SVMML are available from their respective authors website, and we used those to reproduce the baseline results [8]. The code for PCCA is not released publicly. A version from Xiong et al. [8] is available publicly but the memory footprint of that implementation is very high making it impossible to use with large datasets (e.g. it requires 17 GB of RAM to run on the CAVIAR dataset). Therefore to reproduce the results in [8] we wrote our own implementation, which uses 30 times less memory and can scale to much larger datasets. We also ran sanity checks to make sure that it behaves the same as that of the baseline code. All the implementations were done in Matlab with mex functions for the acceleration of the critical components.

In order to fairly evaluate the algorithms, we set the dimensionality of the projected space to be same for WARCA, rPCCA and LFDA. For the Market-1501 dataset the dimensionality used is 200 and for VIPeR it is 100 and all the other datasets it is 40. We choose the regularization parameter and the learning rate through cross-validation across the data splits using grid search in \((\lambda , \eta ) \in \{ 10^{-8}, \dots , 1 \} \times \{ 10^{-3}, \dots , 1 \}\). Margin \(\gamma \) is fixed to 1. Since the size of the parameter matrix scales in \(O(D^2)\) for SVMML and KISSME we first reduced the dimension of the original features using PCA keeping 95\(\%\) of the original variance and then applied these algorithms. In our tables and figures WARCA\(-\chi ^2\), WARCA-L, rPCCA\(-\chi ^2\), rPCCA-L, LFDA\(-\chi ^2\) and LFDA-L denote WARCA with \(\chi ^2\) kernel, WARCA with linear kernel, rPCCA with \(\chi ^2\) kernel, rPCCA with linear kernel, and LFDA with \(\chi ^2\) kernel, LFDA with linear kernel respectively.

For all experiments with WARCA we used harmonic weighting for the rank weighting function of Eq. 8. We also tried uniform weighting which gave poor results compared to the harmonic weighting. For all the datasets we used a mini-batch size of 512 in the SGD algorithm and we ran the SGD for 2000 iterations (A parameter update using the mini-batch is considered as 1 iteration).

Tables 2a and b summarize respectively the rank-1 and rank-5 performance of all the methods, and Table 2c summarizes the Area Under the Curve (AUC) performance score. Figure 1 reports the CMC curves comparing WARCA against the baselines on all the nine datasets. The circle and the star markers denote linear and kernel methods respectively.

WARCA improves over all other methods on all the datasets. On VIPeR, 3DPeS, PRID450s and iLIDS datasets LFDA come very close to the performance of WARCA. The reason for this is that these datasets are too small and consequently simple methods such as LFDA which exploits strong prior assumptions on the data distribution work nearly as well as WARCA.

4.3 Comparison Against State-of-the-Art

We also compare against the state-of-the-art results reported using recent algorithms such as MLAPG on LOMO features [9], MLPOLY [25] and IDEEP [26] on VIPeR, CUHK01 and CUHK03 datasets. The reason for not including these comparisons in the main results is because apart from MLAPG the code for other methods is not available, or the features are different which makes a fair comparison difficult. Our goal is to evaluate experimentally that, given a set of features, which is the best off-the-shelf metric learning algorithm for re-identification.

Table 3. Comparison of WARCA against state-of-the-art results for person re-identification

In this set of experiments we used the state-of-the-art LOMO features [4] with WARCA for VIPeR and CUHK01 datasets. The results are summarized in the Table 3. We improve the rank-1 performance by \(21\,\%\) on CUHK03 by \(1.40\,\%\) on CUHK01 dataset.

4.4 Analysis of the AON Regularizer

Here we present an empirical analysis of the AON regularizer against the standard Frobenius norm regularizer. We used the VIPeR dataset with LOMO features for the experiments shown in the first row of Fig. 2. With very low regularization strength AON and Frobenius behave the same. As the regularization strength increases, Frobenius results in rank deficient mappings (Fig. 2a), which is less discriminant and perform poorly on the test set (Fig. 2b). The AON regularizer on the contrary pushes towards orthonormal mappings, and results in an embedding well conditioned, which generalizes well to the test set. It is also worth noting that training with the AON regularizer is robust over a wide range of the regularization parameter, which is not the case the Frobenius norm. Finally, the AON regularizer was found to be very robust to the choice of the SGD step size \(\eta \) (Fig. 2c) which is a crucial parameter in large-scale learning. A similar behavior was observed by Lim et al. [10] with their orthonormal Riemannian gradient update step in the SGD but it is computationally expensive and not trivial to use with modern SGD algorithms such as Adam [31], and Nesterov’s momentum [35].

Fig. 2.
figure 2

Comparison of the Approximate OrthoNormal (AON) regularizer we use in our algorithm to the standard Frobenius norm (\(L_2\)) regularizer. Graph (a) shows the condition number (ratio between the two extreme eigenvalues of the learned mapping) vs. the weight \(\lambda \) of the regularization term. As expected, the AON regularizer pushes this value to one, as it eventually forces the learning to chose an orthonormal transformation, while the Frobenius regularizer eventually kills the smallest eigenvalues to zero, making the ratio extremely large. Graph (b) shows the Rank-1 performance vs. the regularizer weight \(\lambda \), graph (c) the Rank-1 performance vs. the SGD step size \(\eta \), graph (d) CMC curve with the two regularizers and finally graph (e) shows the Rank-1 performance on different datasets

4.5 Analysis of the Training Time

Figure 3 illustrates how the performance in test of WARCA and rPCCA increase as a function of training time on 3 datasets. We implemented both the algorithms entirely in C++ to have a fair comparison of running times. In this set of experiments we used 730 test identities for CUHK03 dataset to have a quick evaluation. Experiments with other datasets follow the same protocol described above. Please note that we do not include spectral methods in this plot because the solutions are found analytically. Linear spectral methods are very fast for low dimensional problems but the training time scales quadratically in the data dimension. In case of kernel spectral methods the training time scales quadratically in the number of data points. We also do not include iterative methods MLAPG and SVMML because they proved to be very slow and not giving good performance.

Fig. 3.
figure 3

WARCA performs significantly better than the state-of-the-art rPCCA on large datasets for a given training time budget

5 Conclusion

We have proposed a simple and scalable approach to metric learning that combines a new and simple regularizer to a proxy for a weighted sum of the precision at different ranks. The later can be used for any weighting of the precision-at-k metrics. Experimental results show that it outperforms state-of-the-art methods on standard person re-identification datasets, and that contrary to most of the current state-of-the-art methods, it allows for large-scale learning.