Multiview Discriminative Geometry Preserving Projection for Image Classification

In many image classification applications, it is common to extract multiple visual features from different views to describe an image. Since different visual features have their own specific statistical properties and discriminative powers for image classification, the conventional solution for multiple view data is to concatenate these feature vectors as a new feature vector. However, this simple concatenation strategy not only ignores the complementary nature of different views, but also ends up with “curse of dimensionality.” To address this problem, we propose a novel multiview subspace learning algorithm in this paper, named multiview discriminative geometry preserving projection (MDGPP) for feature extraction and classification. MDGPP can not only preserve the intraclass geometry and interclass discrimination information under a single view, but also explore the complementary property of different views to obtain a low-dimensional optimal consensus embedding by using an alternating-optimization-based iterative algorithm. Experimental results on face recognition and facial expression recognition demonstrate the effectiveness of the proposed algorithm.


Introduction
Many computer vision and pattern recognition applications involve processing data in a high-dimensional space. Directly operating on such high-dimensional data is difficult due to the so-called "curse of dimensionality. " For computational time, storage, and classification performance considerations, dimensionality reduction (DR) techniques provide a means to solve this problem by generating a succinct and representative low-dimensional subspace of the original highdimensional data space. Over the past two decades, many dimensionality reduction algorithms have been proposed and successfully applied to face recognition [1]. The most representative ones are principal component analysis (PCA) and linear discriminant analysis (LDA) [2].
PCA is an unsupervised dimensionality reduction method, which aims to project the high-dimensional data into a low-dimensional subspace spanned by the leading eigenvectors of a covariance matrix. LDA is supervised and its goal is to pursue a low-dimensional subspace by maximizing the ratio of between-class variance to withinclass variance. Due to the utilization of label information, LDA usually outperforms PCA for classification tasks when sufficient labeled training data are available. While these two algorithms have attained reasonable good performance in pattern classification, they may fail to discover a highly nonlinear submanifold embedded in the high-dimensional ambient space as they seek only a compact Euclidean subspace for data representation and classification [3].
Recently, there has been considerable interest in manifold learning algorithms for dimensionality reduction and feature extraction. The basic consideration of these algorithms is that the high-dimensional data may lie on an intrinsic nonlinear low-dimensional manifold. In order to detect the underlying manifold structure, nonlinear dimensionality reduction algorithms such as ISOMAP [4], locally linear embedding (LLE) [5], and Laplacian eigenmap (LE) [6] have been proposed. All of these algorithms are defined only on the training data, and the issue of how to map new testing data remains difficult. Therefore, they cannot be applied to classification problem directly. To overcome the above so-called out-ofsample problem, He and Niyogi [7] developed the locality preserving projection (LPP), in which the linear projection function is adopted for mapping new data samples. As LPP is originally unsupervised, some recent attempts have exploited the discriminant information and derived many discriminant 2 The Scientific World Journal manifold learning algorithms to enhance the classification performance. The representative algorithms include local discriminant embedding (LDE) [8], locality sensitive discriminant analysis (LSDA) [9], margin Fisher analysis (MFA) [10], local Fisher discriminant analysis (LFDA) [11], and discriminative geometry preserving projection (DGPP) [12]. Despite having different assumptions, all these algorithms can be unified into a general graph embedding framework (GEF) [10] with different constraints. While these algorithms have utilized both local geometry and the discriminative information for dimensionality reduction and achieved reasonably good performance in different pattern classification tasks, they assume that the data are represented in a single vector. They can be regarded as single-view-based methods and thus cannot handle data described by multiview features directly. In many practical pattern classification applications, different views (visual features) have their own specific statistical properties, and each view represents the data partially. To address this problem, the traditional solution for multiple view data is to simply concatenate vectors of different views into a new long vector and then apply dimensionality reduction algorithms directly on the concatenated vector. However, this concatenation ignores the diversity of multiple views and thus cannot explore the complementary nature and specific statistical properties of different views. Recent studies have provided convincing evidence of this fact [13][14][15]. Hence, it is more reasonable to assign different weights to different views (features) for feature extraction and classification. In computer vision and machine learning research, many works have shown that leveraging the complementary nature of the multiple views can better represent the data for feature extraction and classification [13][14][15]. Therefore, an efficient manifold learning algorithm that can cope with multiview data and place proper weights on different views is of great interest and significance.
Motivated by the above observations and reasons, we propose unifying different views under a discriminant manifold learning framework called multiview discriminative geometry preserving projection (MDGPP). Under each view, we can implement the discrimination and local geometry preservations as those used in discriminative geometry preserving projection (DGPP) [12]. Unifying different views in such a multiview discriminant manifold learning framework is meaningful, since data with different features can be appropriately integrated to further improve the classification performance. Specifically, we first implement the discrimination preservation by maximizing the average weighted pairwise distance between samples in different classes and simultaneously minimizing the average weighted pairwise distance between samples in the same class. Meanwhile, the local geometry preservation is implemented by minimizing the reconstruction error of samples in the same class. Then, we learn a low-dimensional feature subspace by utilizing both intraclass geometry and interclass discrimination information, such that the complementary nature of different views (features) can be fully exploited when classification is performed in the derived feature subspace. Experimental results on face recognition and facial expression recognition are presented to demonstrate the effectiveness of the proposed algorithm.
The remainder of the paper is organized as follows. Section 2 reviews the related works. Section 3 presents the details of the proposed MDGPP algorithm. Experimental results on face recognition are presented in Section 4, and the concluding remarks are provided in Section 5.

Related Works
Multiview learning is one important topic in the machine learning and pattern recognition communities. In such a setting, view weight information is introduced to measure the importance of different features in characterizing data, and different weights reflect different contribution to the learning process. The aim of multiview learning is to exploit more complementary information of different views rather than only a single view to further improve the learning performance. The traditional solution for multiview data is to concatenate all features into one vector and then conduct machine learning for such feature space. However, this solution is not optimal as these features usually have different physical properties. Simply concatenating them will ignore the complementary nature and specific statistical properties of different views, and thus causing performance degradation. In addition, this simple concatenation will end up with the curse of dimensionality problem for the subsequent learning task.
In order to perform multiview learning, much effort has been focused on multiview metric learning [14], multiview classification and retrieval [16], multiview clustering [15], and multiview semisupervised learning [17]. All these approaches demonstrated that the learning performance can be significantly enhanced if the complementary nature of different views is exploited and all views are appropriately integrated. It is very natural that multiview learning idea should also be considered in dimensionality reduction. However, most of the existing dimensionality reduction algorithms are designed only for single view data and cannot cope with multiview data directly. To address this problem, Long et al. [18] first proposed multiple view spectral embedding (MVSE) method. MVSE performs a dimensionality reduction process on each view independently, and then based on the obtained low-dimensionality representation, it constructs a common low-dimensional embedding that is close to each representation as much as possible. Although MVSE allow selecting different dimensionality reduction algorithms for each view, the original multiview data are invisible to the final learning process. Thus, MVSE cannot well explore the complementary information of different views. Xia et al. [19] proposed multiview spectral embedding (MSE) method to find lowdimensional and sufficiently smooth embedding based on the patch alignment framework [20]. However, MSE ignores the flexibility of allowing shared information between subset of different views owing to the global coordinate alignment process. To unify different views for dimensionality reduction under a probabilistics, Xie et al. [21] extended the stochastic The Scientific World Journal 3 neighbor embedding (SNE) to its multiview version and proposed multiview stochastic neighbor embedding (MSNE). Although MSNE operates on a probabilistic framework, it is an unsupervised method and its classification abilities may be limited since the class label information is not used in the learning process. More recently, inspired by the recent advances of sparse coding technique, Han et al. [22] proposed spectral sparse multiview embedding (SSMVE) method to deal with dimensionality reduction for multiview data. Although SSMVE can impose sparsity constraint on the loading matrix of multiview dimensionality reduction, it is unsupervised and does not explicitly consider the manifold structure on which the high dimensional data possibly reside. In the next section, focusing on the manifold learning and pattern classification, we propose a novel multiview discriminative geometry preserving projection (MDGPP) for multiview dimensionality reduction, which explicitly considers the local manifold structure and discriminative information as well as the complementary characteristics of different views in high-dimensional data.

Multiview Discriminative Geometry Preserving Projection (MDGPP)
In this section, we propose a new manifold learning algorithm called multiview discriminative geometry preserving projection (MDGPP), which aims to find a unified lowdimensional and sufficiently smooth embedding over all views simultaneously. To better explain the algorithm details of the proposed MDGPP, we introduce some important notations used in the remainder of this paper. Capital letters such as denote data matrices, and represents the ( , ) entry of . Lower case letters such as denote data vectors, and represents the th data element of . Superscript ( ) such as ( ) and ( ) represents data from the th view, respectively. Based on these notations, MDGPP can be described as follows according to the DGPP framework [12].
Given a multiview data set with data samples and each with feature representations, that is, , wherein ( ) represents the feature matrix for the th view, the aim of MDGPP is to find a projective matrix where denotes the dimension of low-dimensional feature representation and satisfies < (1 ≤ ≤ ). The workflow of MDGPP can be simply described as follows. First, MDGPP builds a part optimization for a sample on a single view by preserving both the intraclass geometry and interclass discrimination. Afterward, all parts of optimization from different views are unified as a whole via view weight coefficients. Then an alternating-optimization-based iterative algorithm is derived to obtain the optimal low-dimensional embedding from multiple views.
Given the th view ( ) = [ ( ) 1 , ( ) 2 , . . . , ( ) ] ∈ × , MDGPP first makes an attempt to preserve discriminative information in the reduced low-dimensional space by maximizing the average weighted pairwise distance between samples in different classes and simultaneously minimizing the average weighted pairwise distance between samples in the same class on the th view. Thus, we have arg max where Tr( ) denotes the trace operation of matrix, ( ) (= ( ) − ( ) ) is the graph Laplacian on the th view, ( ) is a diagonal matrix with its element ( ) = ∑ ( ) on the th view, and is the weighting matrix which encodes both the distance weighting information and the class label information on the th view where in is the class label of sample ( ) , is the number of samples belonging to the th class, and is set as exp(−‖ ( ) − ( ) ‖/ 2 ) according to LPP [7] for locality preservation. Second, we try to implement the local geometry preservation by assuming that each sample ( ) can be linearly reconstructed by the samples ( ) which share the same class label with ( ) on the th view. Thus, we can obtain the reconstruction coefficient ( ) by minimizing the reconstruction error ∑ =1 ‖ ‖ 2 on the th view; that is, arg min∑ (4) Then, by solving (3) and (4), we have where , = ( ( ) − ( ) ) ( ( ) − ( ) ) denotes the local Gram matrix and = = .
Once obtaining the reconstruction coefficient ( ) on the th view, then MDGPP aims to reconstruct ( ) (= ( ) ) where ( ) is an identity matrix defined on the th view, and is the reconstruction coefficient matrix on the th view. As a result, by combining (1) and (6) together, the part optimization for ( ) is arg max , and is a tradeoff coefficient which is empirically set as 1 in this experiment.
Based on the local manifold information encoded in ( ) and ( ) , (7) aims at finding a sufficiently smooth lowdimensional embedding ( ) (= ( ) ) by preserving the interclass discrimination and intraclass geometry on the th view.
Because multiviews could provide complementary information in characterizing data from different viewpoints, different views certainly have different contributions to the low-dimensional feature subspace. In order to well discover the complementary information of data from different views, a nonnegative weighted set = [ 1 , 2 , . . . , ] is imposed on each view independently. Generally speaking, the larger is, the more the contribution of the view ( ) is made to obtain the low-dimensional feature subspace. Hence, by summing over all parts of optimization defined in (7), we can formulate MDGPP as the following optimization problem: subject to The solution to in (8) subject to (9) is = 1 corresponding to the maximum Tr( ( ) ) over different views, and = 0 otherwise, which means that only the best view is finally selected by this method. Consequently, this solution cannot meet the demand for exploring the complementary characteristics of different views to get a better low-dimensional embedding than that based on a single view. In order to avoid this problem, we set ← with > 1 by following the trick utilized in [16][17][18][19]. In this condition, ∑ =1 = 1 achieves its maximum when = 1/ according to ∑ =1 = 1 and ≥ 0. Similarly for different views can be obtained by setting > 1; thus each view makes a specific contribution to obtaining the final lowdimensional embedding. Consequently, the new objective function of MDGPP can be defined as follows: The Scientific World Journal 5 subject to The above optimization problem is a nonlinearly constrained nonconvex optimization problem, so there is no direct approach to find its global optimal solution. In this paper, we derive an alternating-optimization-based iterative algorithm to find a local optimal solution. The alternating optimization iteratively updates the projection matrix and weight vector = [ 1 , 2 , . . . , ].
First, we update by fixing . The optimal problem (10) subject to (11) becomes subject to Following the standard Lagrange multiplier, we construct the following Lagrangian function by incorporating the constraint (13) into (12): where the Lagrange multiplier satisfies ≥ 0.
Taking the partial derivation of the Lagrangian function ( , ) with respect to and and setting them to zeros, we have Hence, according to (13) and (14), the weight coefficient can be calculated as Then, we can make the following observations according to (15): If → ∞; then the values of different will be close to each other. If → 1, then only = 1 corresponding to the maximum Tr( ( ) ) over different views, and = 0 otherwise. Thus, the choice of should respect to the complementary property of different views. The effect of the parameter will be discussed in the later experiments.
Second, we update by fixing . The optimal problem (10) subject to (11) can be equivalently transformed into the following form: where = ∑ =1 ( ) . Since ( ) defined in (7) is a symmetric matrix, is also a symmetric matrix.
From the above Ky Fan theorem, we can make the following observations. The optimal solution to (18) subject to (19) is composed of the largest eigenvectors of the matrix , and the optimal value of objective function (18) equals the sum of the largest eigenvalues of the matrix . Therefore, the optimal reduced feature dimension is equivalent to the number of positive eigenvalues of the matrix .
Alternately updating and by solving (17) and (20) until convergence, we can obtain the final optimal projection matrix for multiple views. A simple initialization for could be = [1/ , . . . , 1/ ]. According to the aforementioned statement, the proposed MDGPP algorithm is summarized as follows.
Step 1. Simultaneously consider both intraclass geometry and interclass discrimination information to calculate ( ) for each view according to (7).
Step 4 (output projection matrix). Output the final optimal projection matrix = .
We now briefly analyze the computational complexity of the MDGPP algorithm, which is dominated by three parts. One is for constructing the matrix ( ) for different views. As shown in (7), the computational complexity of this part is ((∑ =1 ) × 2 ). In addition, each iteration involves computing view weight and solving a standard eigendecomposition problem; the computational complexity of running two parts in each iteration is (( + ) × 2 ) and ( 3 ), respectively. Therefore, the total computational complexity of MDGPP is where max denotes the iteration number and is always set to less than five in all experiments.

Experimental Results
In this section, we evaluate the effectiveness of our proposed MDGPP algorithm for two image classification tasks including face recognition and facial expression recognition. Two widely used face databases including AR [24] and CMU PIE [25] are employed for face recognition evaluation, and the well-known Japanese female facial expression (JAFFE) [26] database is used for facial expression recognition evaluation. We also compare the proposed MDGPP algorithm with some traditional single-view-based dimensionality reduction algorithms, such as PCA [2], LDA [2], LPP [3], MFA [10], DGPP [12], and the three latest multiview dimensionality reduction algorithms, including MVSE [18], MSNE [21], MSE [19], and SSMVE [22]. The nearest neighbor classifier with the Euclidean distance was adopted for classification. For a fair comparison, all the results reported here are based on the best tuned parameters of all the compared algorithms.

Data Sets and Experimental Settings.
We conducted face recognition experiments on the widely used AR and CMU PIE face databases and facial expression recognition experiments on the well-known Japanese female facial expression (JAFFE) database.
The AR database [24] contains over 4,000 color images corresponding to 126 people (70 men and 56 women), which include frontal view faces with different facial expressions, illumination conditions, and occlusions (sun glasses and scarf). Each person has 26 different images taken in two sessions (separated by two weeks). In our experiments, we used a subset of 800 face images from 100 persons (50 men and 50 women) with eight face images of different expressions and lighting conditions per person. Figure 1 shows eight  The CMU PIE database [25] comprises more than 40,000 facial images of 68 people with different poses, illumination conditions, and facial expressions. In this experiment, we selected a subset of the CMU PIE database which consists of 3060 frontal face images with varying expression and illumination from 68 persons with 45 images from each person. Figure 2 shows some sample images of one individual from the subset of the CMU PIE database.
The Japanese female facial expression (JAFFE) database [26] contains 213 facial images of ten Japanese women. Each facial image shows one of seven expressions: neutral, happiness, sadness, surprise, anger, disgust, or fear. Figure 3 shows some facial images from the JAFFE database. In this experiment, following the general setting scheme of facial expression recognition, we discard all the neutral facial images and only utilize the remainder 183 facial images which include six basic facial expressions.
For all the face images in the above three face databases, the facial part of each image was manually aligned, cropped, and resized into 32 × 32 according to the eye's positions. For each facial image, we extract the commonly used four kinds of low-level visual features to represent four different views. These four features include color histogram (CH) [27], scaleinvariant feature transform (SIFT) [28], Gabor [29], and local binary pattern (LBP) [30]. For the CH feature extraction, we used 64 bins to encode a histogram feature for each facial image according to [27]. For the SIFT feature extraction, we densely sampled and calculated the SIFT descriptors of 16 × 16 patches over a grid with spacing of 8 pixels according to [28]. For the Gabor feature extraction, following [29], we adopted 40 Gabor kernel functions from five scales and eight orientations. For the LBP feature extraction, we followed the parameter settings in [30] and utilized 256 bins to encode a histogram feature for each facial image. For more details on these four feature descriptors, please refer to [27][28][29][30]. Because these four features are complementary to each other in representing facial images, we empirically set the tuning parameter in MDGPP to be five.
In this experiment, each facial image set was partitioned into the nonoverlap training and testing sets. For each database, we randomly selected 50% data as the training set and use the remaining 50% data as the testing set. To reduce statistical variation for each random partition, we repeated these trials independently ten times and reported the average recognition results.
The Scientific World Journal

Compared Algorithms.
We compared our proposed MDGPP algorithm with the following dimensionality reduction algorithms.
(2) LDA [2]: LDA is a supervised dimensionality reduction algorithm. We adopted a Tikhonov regularization term rather than PCA preprocessing to avoid the well-known small sample size (singularity) problem in LDA.
(3) LPP [3]: LPP is an unsupervised manifold learning algorithm. There is a nearest neighbor number to be tuned in LPP and it was empirically set to be five in our experiments. In addition, the Tikhonov regularization was also adopted to avoid the small sample size (singularity) problem in LPP.
(4) MFA [10]: MFA is a supervised manifold learning algorithm. There are two parameters (i.e., 1 nearest neighbor number and 2 nearest neighbor number) to be tuned in MFA. We empirically set 1 = 5 and 2 = 20 in our experiments. Meanwhile, the Tikhonov regularization was also adopted to avoid the small sample size (singularity) problem in MFA.
(5) DGPP [12]: DGPP is a supervised manifold learning algorithm. There is a tradeoff parameter to be tuned in DGPP and it was empirically set to be one in our experiments.
(7) MSNE [21]: MSNE is a probability-based unsupervised multiview algorithm. We followed the parameter setting in [21] and set the tradeoff coefficient to be five in our experiments.
(8) MSE [19]: MSE is a supervised multiview algorithm. There are two parameters (i.e., the nearest neighbor number and the tuning coefficient ) to be tuned in MSE. We empirically set = 5 and = 5 in our experiments.
(9) SSMVE [22]: SSMVE is a sparse unsupervised multiview algorithm. We followed the parameter setting method in [22] and set the regularized parameter to be one in our experiments.
It is worth noting that since PCA, LDA, LPP, MFA, and DGPP are all single-view-based algorithms, these five algorithms adopt the conventional feature concatenationbased strategy to cope with the multiview data.    respectively. According to the above experimental results, we can make the following observations.
(1) As can be seen from Tables 1, 2, and 3  feature subspace by using both intraclass geometry and interclass discrimination and explicitly considering the complementary information of different facial features can achieve the best recognition performance.
(2) The multiview learning algorithms (i.e., MVSE, MSNE, MSE, SSMVE, and MDGPP ) perform much better than single-view-based algorithms (i.e., PCA, LDA, LPP, MFA, and DGPP), which demonstrates that simple concatenation strategy cannot duly combine features from multiple views, and the recognition performance can be successfully improved by exploring the complementary characteristics of different views.     (3) For the single-view-based algorithms, the manifold learning algorithms (i.e., LPP, MFA, and DGPP) perform much better than the conventional dimensionality reduction algorithms (i.e., PCA and LDA). This observation confirms that the local manifold structure information is crucial for image classification. Moreover, the supervised manifold learning algorithms (i.e., MFA and DGPP) perform much better than the unsupervised manifold learning algorithm LPP, which demonstrates that the utilization of discriminant information is useful to improve the image classification performance.
(4) For the multiview learning algorithms, the supervised multiview algorithms (i.e., MSE and MDGPP) outperform the unsupervised multiview algorithms (i.e., MVSE, MSNE, and SSMVE) due to the utilization of the labeled facial images.
(5) Although MVSE, MSNE, and SSMVE are all unsupervised multiview learning algorithms, SSMVE performs much better than MVSE and MSNE. The possible explanation is that the SSMVE algorithm adopts the sparse coding technique, which is naturally discriminative in determining the appropriate combination of different views.
(6) Among the compared multiview learning algorithms, MVSE performs the worst. The reason is that MVSE performs a dimensionality reduction process on each view independently. Hence it cannot fully integrate the complementary information of different views to produce a good low-dimensional embedding.
(7) MDGPP can improve the recognition performance of DGPP. The reason is that MDGPP can make use of multiple facial feature representations in a common learned subspace such that some complementary information can be explored for recognition task.

Convergence Analysis.
Since our proposed MDGPP is an iteration algorithm, we also evaluate its recognition performance with different numbers of iteration. Figures 7,8,and 9 show the recognition accuracy of MDGPP versus different numbers of iteration on the AR, CMU PIE, and JAFFE databases, respectively. As can be seen from these figures, we can observe that our proposed MDGPP algorithm can converge to a local optimal optimum value in less than five iterations.

Parameter Analysis.
We investigate the parameter effects of our proposed MDGPP algorithm: tradeoff coefficient and tuning parameter . Since each parameter can affect the recognition performance, we fix one parameter as used in the previous experiments and test the effect of the remaining one. Figures 10, 11, and 12 show the influence of the parameter in the MDGPP algorithm on the AR, CMU PIE, and JAFFE databases, respectively. Figures 13, 14, and 15 show the influence of the parameter in the MDGPP algorithm on the AR, CMU PIE, and JAFFE databases, respectively. From Figure 10 to Figure 15, we can observe that MDGPP demonstrates a stable recognition performance over a large range of both and . Therefore, we can conclude that the performance of MDGPP is not sensitive to the parameters and .

Conclusion
In this paper, we have proposed a new multiview learning algorithm, called multiview discriminative geometry preserving projection (MDGPP) for feature extraction and classification by exploring the complementary property of different views. MDGPP can encode different features from different views in a physically meaningful subspace and learn a low-dimensional and sufficiently smooth embedding over all views simultaneously with an alternating-optimizationbased iterative algorithm. Experimental results on three face image databases show that the proposed MDGPP algorithm