Local Feature Discriminant Projection

In this paper, we propose a novel subspace learning algorithm called Local Feature Discriminant Projection (LFDP) for supervised dimensionality reduction of local features. LFDP is able to efficiently seek a subspace to improve the discriminability of local features for classification. We make three novel contributions. First, the proposed LFDP is a general supervised subspace learning algorithm which provides an efficient way for dimensionality reduction of large-scale local feature descriptors. Second, we introduce the Differential Scatter Discriminant Criterion (DSDC) to the subspace learning of local feature descriptors which avoids the matrix singularity problem. Third, we propose a generalized orthogonalization method to impose on projections, leading to a more compact and less redundant subspace. Extensive experimental validation on three benchmark datasets including UIUC-Sports, Scene-15 and MIT Indoor demonstrates that the proposed LFDP outperforms other dimensionality reduction methods and achieves state-of-the-art performance for image classification.


INTRODUCTION
RECENTLY, the use of local features has gained great popularity in computer vision.Based on local feature descriptors, e.g., SIFT [1], the sparse coding algorithm [2], dictionary learning [3], the naive Bayes nearest neighbor (NBNN) classifier [4], and Fisher kernels (FK) [5] have achieved state-of-the-art performance for image classification [6], [7].Nevertheless, the increasingly large quantity of local feature descriptors makes local feature based algorithms severely restricted and even computationally intractable on largescale data spaces.Dimensionality reduction algorithms [8], [9], [10], [11], [12] are needed to reduce the computational complexity.However, due to the huge number N (up to 100 M) of local feature descriptors, traditional algorithms [13], [14], e.g., manifold learning using nearest neighbor search (NN-search) with a computational complexity of at least OðN 2 Þ, tend to be computationally prohabitive.Efficient algorithms are highly desirable to handle such huge amount of local feature descriptors for dimensionality reduction.
Furthermore, local feature descriptors, e.g., SIFT, are typically constructed in an unsupervised way, which would be less discriminative and contain redundant information.In contrast, supervised subspace learning [15] can not only reduce dimensions of local feature descriptors by removing redundant features but also improve the discriminability of local feature descriptors for classification.In fact, the label information could be used to achieve supervised dimensionality reduction of local feature descriptors, which however has not previously been investigated in the literature.
In this paper, we propose a novel, efficient supervised subspace learning algorithm called Local Feature Discriminant Projection (LFDP) for dimensionality reduction of local features.Most dimensionality reduction methods are performed on the image representation level, while this paper focuses on the local feature level.LFDP offers an efficient discriminant analysis which can not only reduce the dimensionality but also enhance discriminative ability of local features.To achieve a supervised local feature reduction, we adopt the image-to-class (I2C) distance [4], [10], [16] which provides an effective measurement of distances between images and classes by incorporating class label information into local features.The discriminative analysis is established by adopting the Differential Scatter Discriminant Criterion (DSDC) [17], [18] into the I2C based image representations.The advantage of using DSDC is the avoidance of the matrix singularity problem [19], a shortcoming of LDA, which enables more accurate computation.Towards efficient computation of I2C distances, we use k-means clustering to reduce the range of NN-search into the centroids of local feature clusters in each class, which makes our algorithm computational efficient without compromising the performance.
With the DSDC, we build our objective function to minimize the within-class variance while maximizing the between-class variance.However, the solution of our objective function is nontrivial due to its quartic form.We use the gradient descent algorithm on a sphere to solve this problem.In addition, an orthogonality constraint is imposed on the projections to make the subspace more compact and less redundant [8].Unfortunately, existing orthogonalization methods [20], [21] cannot be straightforwardly applied to our scheme since they only orthogonalize the projections of the eigen-decomposition problem, which motivates us to propose a general orthogonalization on the projections via an induction method.The proposed generalized orthogonalization can also be widely applied to any other projection optimization problems.To summarize, the proposed LFDP possesses the following attractive merits: Unrestricted dimension: Unlike LDA, in which the reduced dimension is restricted by the number of classes, LFDP can project data onto any lower-dimensional space without suffering from the matrix singularity problem.OðNÞ complexity: The time complexity of our algorithm is linear for N.In contrast to most manifold learning methods that need at least OðN 2 Þ time, our algorithm can be practically used for dimensionality reduction on largescale data spaces.Generalized orthogonalization: The proposed orthogonalization method is more general and intuitive than previous methods [20], [21], and can also be applied to any other algorithms that need to compute projection matrices with the orthogonality constraints.

RELATED WORK
Principal Component Analysis (PCA) is a popular dimensionality reduction method that can be directly applied to local features, which, like most unsupervised methods, makes the reduced features relatively less discriminative compared to supervised methods.Ke and Sukthankar [22] applied PCA to project the gradient image vector of a patch to a more compact descriptor, which is shorter than the standard SIFT descriptor but more robust to image deformations.Existing manifold learning algorithms, e.g., Laplacian Eigenmap (LE) [23], Locally Linear Embedding (LLE) [24] and ISOMAP [25], were proposed to learn the nonlinear structure of the data manifold.These algorithms suffer from the out-of-sample problem [26].Locality Preserving Projections (LPP) [27] and Neighborhood Preserving Embedding (NPE) [28] as the linearized versions of LE and LLE, respectively, were developed to solve the out-of-sample problem.As unsupervised methods, they can be used for both global and local feature reduction.
However, applying them to a large number of local features is computationally infeasible due to their high complexity.Moreover, similar to PCA, their discriminative ability is limited, as class label information is not used.
Linear Discriminant Analysis (LDA) is a conventional supervised method based on the Fisher criterion, which can also be imprudently employed for local feature reduction by using the class labels of the images from which local features are extracted.However, the large variability of local features will inevitably mislead the classifier since similar local features could be shared by images from different classes.Discriminative local descriptor learning has been explored individually in [8] and [9], both of which use the same covariance matrices of pair-wise matched and unmatched feature distances to find the linear projection.Recently, Simonyan et al. [29] proposed learning local feature descriptors using convex optimization.In fact, class labels of images are not used in the learning process, which makes the projections lose connection with classification and are therefore suboptimal.These discriminative methods [8], [29] need huge amount of ground truth with matched/unmatched pairs of local feature descriptors for training, which is not applicable in a realistic setting.Zhen et al. [10] proposed a supervised algorithm named I2C Distance Discriminative Embedding (I2CDDE) for dimensionality reduction of local features, which is specifically designed for the NBNN classifier and also computationally expensive.Furthermore, these dimension reduction methods have at least OðN 2 Þ computational complexity, which severely limits their application in large-scale data spaces.

LOCAL FEATURE DISCRIMINANT PROJECTION
In this section, we introduce our Local Feature Discriminant Projection algorithm before which the I2C distance is revised.With image representations based on I2C distances, we build our objective function by incorporating the DSDC for discriminant analysis of local features.To solve the objective function, we present a gradient descent optimization algorithm with a novel, generalized orthogonalization procedure.

Notations
We are given n images X 1 ; . . .; X n from C classes.For the c-th class, it contains n c samples, c ¼ 1; . . .; C. Each image X i is represented by a set of local feature descriptors fx i1 ; . . .; x im i g, where x ij 2 R D is the jth local feature of the ith image, j ¼ 1; . . .; m i , i ¼ 1; . . .; n.We denote N ¼ P n i¼1 m i as the total number of local feature descriptors from training images.

Image-to-Class Distance
The I2C distance introduced in the naive Bayes nearest neighbor classifier [4] represents the average of the sum of all distance squares from the local feature descriptors of an image to their corresponding nearest neighbors in each class.To be specific, the I2C distance from image X i to class c is defined as where x c ij is the nearest neighbor of x ij in class c and k Á k is the L 2 norm.However, in our scheme, to reduce the complexity of NNsearch in the computation of I2C distances, we first employ the K-means clustering algorithm on the set of local feature descriptors of each class [30], [31], i.e., S X i 2class c X i , c ¼ 1; . . .; C. The search range is now reduced to the cluster centers, i.e., we let x c 2 Centroids of S X i 2class c X i for each c.The I2C distance is a non-parametric approximation of the loglikelihood log pðX i jcÞ ¼ log Q m i j¼1 pðx ij jcÞ [4].When using Gaussian kernel density estimation, we have where x represents an arbitrary local feature descriptor and x ðcÞ 1 ; . . .; x ðcÞ Lc are the local features extracted from all the images in class c.Note that with fixed centers, diagonal covariance matrices and equal weights, the density estimation turns out to be a special case of Gaussian mixture models (GMM) used in a state-of-the-art image representation called Fisher vectors [5], [32].If we choose the centers, covariance matrices and weights of the GMM as, for instance, all of the training local features fx 1 ; Á Á Á ; x N g, diagonal matrices and equal weights respectively, we have In this case, if the number of local features in each class (L c ) is the same, the log-likelihood of the GMM is positively related to the "average" of all the I2C distances and its gradients with respect to parameters construct a Fisher vector.
Based on I2C distances, we propose local feature discriminant projection by applying a discriminant analysis to local features for supervised dimensionality reduction.It is worthwhile to highlight that our LFDP is not restricted to the I2C distance.Other measurements, e.g., Kullback-Leibler divergence, the Hausdorff distance and the Bhattacharyya distance, could also be used to measure the relationship between images and classes.More importantly, our LFDP is a general supervised algorithm for dimension reduction which can be applied to any local feature descriptors including not only the handcrafted SIFT used in this paper, but also recent deep learning based representations [33], [34].
In addition, local features reduced by our LFDP can be fed to existing different representation methods, e.g., the bag-of-words model, sparse coding, NBNN and Fisher kernels.We use the Fisher kernels for the final image representations in order to achieve stateof-the-art performance.

Discriminant Analysis
Our goal is to seek a matrix w 2 R DÂd to project the original local features x ij with dimension D to w T x ij in a lower-dimensional but more discriminative space R d .Note that after applying projection matrix w, the nearest neighbors may change.However, for the large-scale local feature space, we approximately adopt the sum of the distances from w T x ij to the projected nearest neighbor Without loss of generality, we first consider the condition that w is a column vector in the algorithm, i.e., d ¼ 1.In fact, we will compute the column vectors of the projection matrix one by one.In this case, the projected I2C distances of an image will be which is called an I2C vector.In other words, for each image X i , we have a corresponding vector d i in linear space R C which is called I2C vector space.Then we have the mean of the vectors in class i and the mean of all the vectors, denoted by m i and m, respectively.Having the representations with I2C vectors, we incorporate the Differential Scatter Discriminant Criterion in the I2C vector space to obtain our objective function in the following form that needs to be maximized: where is a tuning parameter.m c and m are computed by the following equations where Now we can formulate J as a function of w as follows: where

Gradient Descent on Sphere
The classic eigen-decomposition of a matrix is not applicable to our problem due to the quartic form of the objective function.We adopt a procedure of gradient descent on a sphere to find the projection vector.Our goal is to find the optimal w by maximizing JðwÞ.To obtain the final orthonormal projection matrix, we set a norm constraint kwk ¼ 1 for each vector.However, the update rule of the traditional gradient descent for a maximization problem: w ðtþ1Þ ¼ w ðtÞ þ grJðw ðtÞ Þ does not guarantee this constraint.Thus we amend the original algorithm and constrain it on the D-dimensional unit sphere.
Define two matrix-valued functions: and We obtain the gradient of JðwÞ: We project rJðwÞ onto the tangent direction of w on the sphere as p ¼ rJðwÞ À hrJðwÞ; wiw and normalize it as p 0 ¼ p=kpk.By using the first-order Taylor expansion, we know rJðwÞ is the steepest increasing direction.For direction p, we have hp; r JðwÞi ¼ hrJðwÞ; rJðwÞi À hrJðwÞ; wi 2 ¼ krJðwÞk 2 À krJðwÞk 2 cos 2 a !0, where a is the angle between rJðwÞ and w.Thus p is still an increasing direction.Then for the tth step, we have the following update rule:

Orthogonality Constraints
Until now we have only computed the projection vector for the first dimension.In this section, we use the induction method to compute the remaining vectors successively and make them mutually orthogonal by using the matrix composed by previous output vectors.Previous works [8], [20] have highlighted the benefits of orthogonality constraints, for instance, avoidance of overfitting and redundancy in representing the subspace.With this orthogonalization procedure, we can establish our whole algorithm.
Suppose we have obtained the first p ðp !1Þ discriminant vectors w 1 ; w 2 ; . . .:w p .We want to compute the next vector w pþ1 to maximize JðwÞ with the orthogonal constraints and an additional norm constraint on w pþ1 , i.e., kw pþ1 k ¼ 1.The method in [20] can not be applied in our scheme due to the high degree of Lagrangian in our case.We use an alternative but more general method by basis transformation to solve this issue.In other words, we compute the next discriminant vector in a special subspace in which the orthogonal constraints vanish.
According to the inductive assumption, vectors w 1 ; w 2 ; . . .; w p should be an orthonormal basis of a subspace in R D .Let us denote spanV p ¼ ðw 1 ; w 2 ; . . .; w p Þ and W p ¼ ½w 1 ; w 2 ; . . .; w p .Then V p is a p-dimensional subspace and W p is a D Â p matrix.Recall that our primary goal is to seek an optimal w by maximizing JðwÞ: Once we have obtained subspace V p , w pþ1 is required to be orthogonal to all the vectors in V p .Consequently, we need to compute the constrained optimization problem arg max w2V ?
p JðwÞ (10) to find the solution of w pþ1 , where V ?p is the null space of V p and dimV ?
p ¼ D À p. Straightforwardly, the data can be projected onto subspace V ?p so that the computation process is completely performed in a ðD À pÞ-dimensional linear subspace, i.e., the new coordinates are in R DÀp .Then the output will be orthogonal to any vectors in V p .For this reason, we need to find a basis B p ¼ ½b 1 ; . . .; b DÀp 2 R DÂðDÀpÞ of V ?
p .In fact, we need only to solve the linear equation W T p X ¼ 0, which is commonly used in linear algebra.Furthermore, we make this basis orthonormal by following the Gram-Schmidt procedure.Now with this orthonormal basis B p , we project data from R D onto subspace V ?p .Specifically, for a local feature and an I2C vector, we have transformations x ij B T p x ij and d i ðv T B T p DX T i1 D X i1 B p v; . . .; v T B T p DX T iC DX iC B p vÞ, respectively, where v is a vector in R DÀp .Then we only need to solve the unconstrained problem in a lower-dimensional space: where M p ðÁÞ and V p ðÁÞ are the images of matrix-valued functions MðÁÞ and V ðÁÞ after the projection, respectively, i.e., DM cj B T p DM cj B p and V c kj B T p V c kj B p .Now it is an optimization problem where the constraints vanish and here we return to our first goal in the ðD À pÞ-dimensional space.

Algorithm 2. Local Feature Discriminant Projection
Input: The input of Algorithm 1 and the target dimension d.Output: The projection matrix w.
Initialization: w ; and B I; Project training data onto the null space of spanðwÞ by using the basis B; Call Algorithm 1 to compute the corresponding projection vector w i in subspace spanðwÞ ?and update w i Bw i ; Update w ½w; w i and let B be an orthonormal basis of spanðwÞ ?by solving the corresponding linear equation and following the Gram-Schmidt procedure.end for Having the solution v Ã for the optimization problem (11)  Remark.The proposed orthogonalization procedure is a more general way to compute orthogonal projection matrices.Note that, in Algorithm 2, given the input of Algorithm 1, we need only Algorithm 1 to output a projection vector without need to know the computation process.Therefore, Algorithm 1 could be seen as a black box that is able to compute the projection vector (for those that output a matrix, we only need its first column).Now we have the following general proposition.
Proposition.Given maximizing (minimizing) algorithm A which takes projected data w T x as input and outputs the optimal vector, and an orthonormal basis B p of ðD À pÞ-dimensional subspace is the optimal solution of Aðw T xÞ in V ?p .

Relations between Algorithm 2 and the Ordinary Eigen-Decomposition
In fact, assuming that the optimization problem is simplified to the eigen-decomposition of a symmetric matrix A 2 R DÂD such as PCA, we prove that the proposed orthogonalization method finds the same eigenvectors with the eigen-decomposition by adopting mathematical induction.Suppose A ¼ P D i¼1 i w i w T i ¼ W LW T is the spectral decomposition of A and 1 ! . . .! D , where L ¼ diagð 1 ; . . .; D Þ and W ¼ ½w 1 ; . . .; w D .Then w T i w j ¼ 0 if i 6 ¼ j and w T i w i ¼ 1 for i ¼ 1; . . .; D. For the first vector, both Algorithm 2 and the eigen-decomposition output the eigenvector w 1 corresponding to the largest eigenvalue of A. Assume Algorithm 2 has output the first k eigenvectors w 1 ; . . .; w k .For the ðk þ 1Þ-th vector, w kþ1 is the eigenvector corresponding to the eigenvalue kþ1 .Algorithm 2 first finds an orthonomal basis B 2 R DÂðDÀkÞ of spanðw 1 ; . . .; w k Þ ? .Since W is an orthogonal matrix, we have spanðw 1 ; . . .; w k Þ ?¼ spanðw kþ1 ; . . .; w D Þ. Then there exists an orthogonal matrix P 2 R ðDÀkÞÂðDÀkÞ such that B ¼ W kþ1 P , where W kþ1 ¼ ½w kþ1 ; . . .; w D .In the ðk þ 1Þth step of Algorithm 2, we eigen-decompose the matrix B T AB to compute its largest eigenvalue.Note that where L kþ1 ¼ diagð kþ1 ; . . .; D Þ. Therefore, the largest eigenvalue of B T AB is still kþ1 , which indicates that the corresponding eigenvalues of the output vectors of Algorithm 2 are 1 ; . . .; D .Then the whole output set of Algorithm 2 is fw 1 ; . . .; w D g up to sign.

Complexity Analysis
Our LFDP is computationally more efficient than most of the existing manifold learning methods.We provide a complexity analysis on the two procedures: gradient descent and orthogonalization of our LFDP in terms of time complexity and memory cost, since in the test phase, the complexity depends on the classifier and the time complexity will apparently be reduced after dimensionality reduction.
Gradient descent.During the iterative procedure of gradient descent, the main cost is induced by the computation of the I2C distances.The time complexity of a brute-force method of NN-search in K centroids with D-dimension is OðKNDÞ.Computing MðwÞ and V ðwÞ needs OðD 2 C 2 Þ and OðD 2 CnÞ time respectively, where n is the number of training images.Then the time complexity of the gradient descent with N iter steps in a D-dimensional space is OðN iter ðD 2 C 2 þ D 2 CnÞÞ and the time complexity of the whole pro- Orthogonalization.We can observe that the main step in the orthogonalization procedure is the Gram-Schmidt procedure, which requires at most Oðnm 2 Þ time and Oðnm þ m 2 Þ memory for computing on m n-dimensional vectors [35].Notice that, in our algorithm, m varies from 1 to d and n varies from where d is the dimension of the projected space.
In total, with the complexity OðTKNDÞ in the K-means, where T is the number of iterations in the K-means, our LFDP algorithm Due to the large number of local feature descriptors, generally N ) D, we show the computational complexity on N through comparing our algorithm with other dimensionality reduction methods in Table 1, where K is the parameter of K-means and k is the parameter of the k-nearest neighbor (KNN) algorithm.In fact, KNN-based algorithms highly rely on the neighborhood structure of each point, which will be changed by K-means clustering.In addition, K-means may also change the order of I2C distances where there are similar classes or noisy data points, and therefore, mislead the learning of I2CDDE leading to the failure of NBNN.In contrast, our discriminant analysis considers the relationships of intra-class and inter-class variations among I2C vectors, achieving a global optimization objective.Therefore, using K-means centroids can not only make our LFDP computationally more efficient but also tolerant to the fluctuation of I2C distances.

EXPERIMENTS
We have extensively validated our LFDP algorithm on three widely used benchmark datasets, i.e., UIUC-Sports, Scene-15 and MIT Indoor.Experimental results show that our LFDP largely outperforms representative dimension reduction algorithms and achieves state-of-the-art performance.

Implementation Details
The optimal tuning parameter for each dataset is selected from one of f0:1; 0:2; . . .; 1g, which yields the best performance by 10-fold cross-validation on the training data.We fix K ¼ 300 in K-means for all datasets and set the maximum number of the K-means iteration as 20.In addition, the K-means clustering for each class can be performed in a parallel way to save time complexity.We take the Improved Fisher Kernel (IFK), which is an improved version of Fisher kernels [36], based on raw SIFT descriptors without dimension reduction as the baseline.We compare with PCA as a representative unsupervised algorithm which has shown competitive and even better performance than manifold learning algorithms including ISOMAP, LLE and LE on diverse tasks [37].LDA is included for comparison as a supervised algorithm.The parameter k of the KNN algorithm in LPP and NPE is tuned by selecting from f5; 6; . . .; 15g.By following the setting in [9], we randomly select 1:5 Â 10 5 local features from all the training sets for training the projection of LDP.ISO-MAP is not involved in the comparison due to the out-of-sample problem.All the experiments are implemented using Matlab 2013b on a workstation configured with an i7 processor and 32 GB RAM.

Datasets
UIUC-Sports.The Sports event dataset was introduced in [38], consisting of eight sports event categories.The number of images in each class ranges from 137 to 250.We follow the experimental setting in [38] to randomly select 70 and 60 images per class for training and testing respectively.The procedure is repeated five times and the average is reported as the final result.Differently, we use the original images rather than the resized ones.
Scene-15.The Scene-15 dataset [39] consists of 4,485 images which are labeled in 15 distinct classes.The number of images in each class varies from 200 to 400.Following the experimental setting in [39], we randomly select 100 images in each class as training data and test the remaining images.The procedure is repeated five times and the average is reported as the final result.
MIT Indoor.The MIT Indoor scene dataset [40] contains 67 indoor scene categories for a total of 15,620 images.The number of images in each class varies from 100 to 734.Eighty and 20 images are selected in each category for training and testing respectively by following the experimental setting in [40] and the average is reported.

Local Feature and Classifier
We use the software provided by Yang et al. [41] to compute the SIFT descriptors.In contrast to existing works which either use multi-scale SIFT descriptors [42], spatial pyramid representation [43] or multiple descriptors [4], [42], we simply use single-scale SIFT descriptors in patches of 16 Â 16.In our experiments, the average numbers of local features extracted from each image in three datasets are all 1,500.Then the total numbers (N) of the training local features in the above three datasets are 900;000, 2;000;000 and 8;000;000, respectively.
We employ a linear SVM classifier with IFK [36] and compute the Fisher vector for each image based on its local features by following the settings in [36] using 256 Gaussians in the GMM.

Resource Requirements
In Table 2, we list the resource requirements for training the projections by different dimensionality reduction methods.The nearest neighbor search and the computation for pairwise distances make OðN 2 Þ methods suffer from the high computational complexity.Note that the runtime for LPP and NPE is a theoretical value since it is infeasible to implement them with such large memory.Therefore, to use the largest possible number of features that can be handled by our workstation, a subset consisting 1:5 Â 10 5 local features is randomly selected from the whole training set for evaluating these methods.

Results
The performance comparison of LFDP with other dimensionality reduction methods is shown in Figs.1a, 1b and 1c for UIUC-Sports, Scene-15 and MIT Indoor, respectively.The baseline represents the performance of SVMs with IFK in the original 128-dimensional SIFT space without dimensionality reduction.The proposed method shows consistent advantages on all the three datasets.Our method improves the baseline phenomenally with a large margin.PCA usually reaches its highest accuracy around the dimension of 50 and remains stable with the increase of dimensionality.Other methods such as LPP, NPE, LDP and I2CDDE only sightly outperform PCA.In contrast with the above methods, we can observe that LFDP goes up rapidly with the increase of the dimension when the dimension is low and achieves the competitive results around the dimension of 40 (even at 30).With the reduced local feature descriptors by LFDP, the dimensionality of Fisher vectors is several times shorter than the original dimension, which reduces the computational cost for classification but strengthens the discriminative ability due to the learning.Furthermore, the advantage of our method has been also shown by comparing with LDA.Note that LDA learns the projection matrix by directly labeling the local features with class labels of images they belong to.Since the performance of LDA is also restricted by the number of classes [44], the upper bound of reduced dimensionality of LDA is C À 1, on which LDA reaches its best performance.We report the best results of PCA and LDA on different datasets for the comparison with the results of LFDP in Table 3. LDA with the Fisher criterion produces results below the baseline on the UIUC-Sports dataset since it contains only eight classes so that the result is obtained by seven-dimensional local descriptors.To alleviate the dimension restriction of LDA with the Fisher criterion, we implement LDA with the DSDC criterion using the parameter similar to Eq. ( 2).We tune in f0:1; 0:2; Á Á Á ; 1g and the best results are reported in Table 3.With the DSDC, the reduced dimension of LDA is not restricted by the number of classes and the results are significantly improved.
LFDP can efficiently find lower-dimensional but more discriminative feature space and achieves the state-of-the-art results [42], [45], [46], which reveals its capability in dimensionality reduction of ubiquitous local feature spaces in large scale.

Algorithm Analysis
We also evaluate the performance of Algorithm 1 in terms of convergency.We randomly initialize w 50 times on the UIUC-Sports dataset and the average value of the objective function in Eq. (3) and the average difference kw ðtÞ À w ðtÀ1Þ k on the first dimension are reported in Fig. 2, where t is the number of iteration and is fixed at 0:1.We can observe that w converges within only 10 steps.Therefore, we always fix the maximum number of iteration at 10 in the experiments.
In addition, LFDP achieves the best performance with a small value of K in K-means, which guarantees the computational efficiency.We have investigated the performance under different values of parameter K as shown in Table 4. On all the three datasets, our method yields the best results with K ¼ 300 which is much smaller than the number of local feature descriptors, which is up to 120;000 in each class.This largely reduces the computational complexity.

CONCLUSION
A new subspace learning algorithm called Local Feature Discriminant Projection has been proposed for supervised dimensionality reduction of local features.The projections for reduction are obtained by optimizing an objective function constructed based on the Differential Scatter Discriminant Criterion and the I2C representations.A general orthogonalization method has been proposed to learn the projections which guarantees a more compact space with less redundancy.The proposed LFDP has a much lower complexity than popular manifold learning methods, providing an alternative way to efficiently analyze large-scale data.The experimental results on three widely used benchmarks for image classification have validated the effectiveness of LFDP and shown its advantages over traditional dimensionality reduction algorithms.In future work, we aim to extend our algorithm to the semi-supervised and unsupervised settings for more practical applications.

Fig. 1 .Fig. 2 .
Fig. 1.Performance (percent) of linear SVMs with IFK in different lower-dimensional subspaces on the UIUC-Sports, Scene-15 and MIT Indoor datasets.Note that we only use one type of local descriptor: SIFT in single-scale patches.
p=2 is the step size.Since w and p 0 are orthogonal, the norm of the updated variable remains of unit length.In addition, to accelerate the convergence, we also employ an adaptive step size u t , i.e., if Jðw ðtþ1Þ Þ !Jðw ðtÞ Þ, we set u tþ1 ¼ minð2u t ; p=2Þ, otherwise, u tþ1 ¼ u t =2.The iterative procedure is described in Algorithm 1.The local feature descriptors fx ij g of each image and the parameter K in K-means.Output: The projection vector w in the first dimension.Employ K-means algorithm for the local feature set of each class; Find the nearest neighbor x c ij of fx ij g in the centroids of each class; Compute matrix-valued functions MðwÞ and V ðwÞ in Eqs.(4) and (5); Initialize step size u 1 2 ð0; p=2Þ and randomly initialize unit vector w ð1Þ ; repeat Compute the projection of rJðw ðtÞ Þ on the tangent direction of w ðtÞ : p ðtÞ ¼ rJðw ðtÞ Þ À hrJðw ðtÞ Þ; w ðtÞ iw ðtÞ and apply normalization p Compute w ðtþ1Þ ¼ w ðtÞ cos u t þ p ðtÞ 0 ¼ p ðtÞ =kp ðtÞ k; in R DÀp , we transform it to an element in V ?DÀp iÞ T ¼ ½b 1 ; . . .; b DÀp T w ¼ B T p w, i.e., the result of multiplying the left side of w by B T p is the coefficient of the representation by B p .Finally, we set w pþ1 ¼ B p Á v Ã 2 V ?p as a linear combination of B p .The whole LFDP algorithm is illustrated in Algorithm 2.
p 2 R D .Actually, R DÀp and V ?p are two isomorphic linear spaces and B p can be regarded as a linear isomorphism between them.Through the representation of an orthonormal basis, for each w 2 V ?p , we have w ¼P DÀp i¼1 w i b i ,where w i 2 R, and the inner product of w and b i will be hw; b i i ¼ w i , 8i.Then ðw 1 ; . . .; w DÀp Þ T ¼ ðhw; b 1 i; . . .; hw; b

TABLE 1
Comparing the Complexity of LFDP with Other Linear Algorithms on N Where K Is the Parameter of K-Means and k Is the Parameter of the KNN Algorithm

TABLE 2 Resource
Requirements of Different Methods for the 900;000 SIFT Features from the UIUC-Sports Dataset

TABLE 3
Performance (Percent) of Linear SVMs with IFK After PCA, LDA and LFDP Reduction on Local Features LDA1is the LDA with the Fisher criterion and LDA 2 is the LDA with the DSDC.The results listed in the table are their best accuracies.The baseline is the classification result of IFK without dimensionality reduction of local feature descriptors.