Robust Face Recognition Based on a New Supervised Kernel Subspace Learning Method

Face recognition is one of the most popular techniques to achieve the goal of figuring out the identity of a person. This study has been conducted to develop a new non-linear subspace learning method named “supervised kernel locality-based discriminant neighborhood embedding,” which performs data classification by learning an optimum embedded subspace from a principal high dimensional space. In this approach, not only nonlinear and complex variation of face images is effectively represented using nonlinear kernel mapping, but local structure information of data from the same class and discriminant information from distinct classes are also simultaneously preserved to further improve final classification performance. Moreover, in order to evaluate the robustness of the proposed method, it was compared with several well-known pattern recognition methods through comprehensive experiments with six publicly accessible datasets. Experiment results reveal that our method consistently outperforms its competitors, which demonstrates strong potential to be implemented in many real-world systems.


Introduction
In the field of face recognition, many different dimensional recognition approaches have been developed in recent times [1][2][3]. Dimensionality reduction is a main problem in numerous recognition techniques, caused by the great amount of data with high dimensions in many real-world utilizations [4][5][6]. In fact, dimensionality reduction techniques have been recommended by researchers to avoid "the cure of dimensionality," so as to amend the computational efficiency of image recognition [7,8]. Generally, dimensionality reduction techniques can be classified into two main groups: i.e., linear and nonlinear. In linear methods, a significant low-dimensional subspace is intended to be discovered in the input data with high-dimensional space, where the embedded data in the input space has a linear structure [9][10][11][12]. Principle component analysis (PCA) is one of the most famous linear methods [13][14][15]. PCA aims to retain global geometric information for data representation through enhancing the trace of the feature covariance matrix [13,16,17]. Linear discriminant analysis (LDA) is a linear technique that seeks to find out the discriminant information for data classification by enhancing the ratio between inter-class and intra-class scatters [16,18]. Some of the limitations of both PCA and LDA are that they could suffer from the small sample size (SSS) [15] issue, and that they may fail to recognize many important data structures that are nonlinear [19,20].
Scholars have developed abundant practical nonlinear dimensionality reduction strategies [21] to address these problems. They can be classified into two types: manifold-learning-based and kernel-based techniques [22,23]. Manifold learning directly aims to discover principal nonlinear data with low-dimensional structures that are concealed in the input space. Isometric feature (1) SKLDNE has been successfully designed to retain local geometric relations of the within-class samples, which are very important for image recognition. Generally, the categorization strength of methods with a linear learning algorithm is restricted. They fail to deal with complicated problems. Many effective nonlinear data features may be lost during the classification progress using linear techniques such as LDNE, LDA, DNE, and LPP. Therefore, applying a nonlinear method can effectively improve the classification performance. (2) This technique is a supervised learning method, as the data scholar acts as a guide to instruct the main algorithm whose conclusion should be found. SKLDNE considers class label information of neighbors in which there is a direct connection with classification, in order to enhance final recognition performance. (3) It benefits from the advantages of "locality" in LPP in which, due to the prior class-label information, geometric relations are preserved. (4) Not only can it build a compact submanifold by minimizing the distance between the same points in the same class, but it also expands the gaps among submanifolds of distinct classes simultaneously, which is called "discrimination." (5) SKLDNE can resolve the SSS problem that is mostly faced by other aforementioned techniques such as PCA, LDA, UDP, and LPP, and the "overlearning of locality" problem in the manifold learning. (6) Due to its kernel weighting, it is very efficient in reducing the negative influence of the outliers on the projection directions, which effectively handles the drawbacks of linear models and makes it more robust to outliers.
The rest of this study is organized as follows: Section 2 categorizes LPP, DNE, and LDNE. Section 3 is devoted to describing the proposed SKLDNE method and the pertinent algorithm and mathematics. Section 4 focuses on the experiments and analyses carried out. Section 5 elucidates conclusions and future research.

Outline of LPP, DNE, and LDNE
In this section we will briefly review LPP, DNE, and LDNE, as our proposed method is designed to possess the best characteristics of these techniques in graph embedding.

Locality Preserving Projection
LPP is one of the successful linear algorithms used for dimensionality reduction that finds graph embedding of data sets. The manifold structure is directly modeled through creating the nearest-neighbor graph, disclosing vicinity relations of data points in order to preserve the local structure of the input data in the projection [48]. LPP works based on a linear approximation of a Laplacian Eigen map, which searches a transformation P in which input data X = [x 1 , x 2 , . . . , x n ] with a high dimension is projected into subspace Z with a low-dimension, while the local structure is retained [49]. In order to calculate linear transformation T, the objective function should be minimized as follows [43]: where weight matrix H (called heat kernel) is obtained by the nearest-neighbor graph and z i = P T x i . If x i is among l nearest neighbor of x j or x j is among l nearest neighbor of x i , then where parameter t is an appropriate constant number. Otherwise, H(i, j) = 0. On the other hand, when x i and x j are the nearest neighbors, weight matrix H could be clearly set as: H(i, j) = 1. Otherwise, H(i, j) = 0. The optimal transformation matrix can be calculated by converting the minimization problem into solving the generalized eigenvalue problem where L = D -H is the Laplacian matrix and Dii = ∑ j H(i, j) is a diagonal matrix.

Discriminant Neighborhood Embedding
Discriminant neighborhood embedding (DNE) is suggested based on an intuition of a dynamics theory. DNE works under the category of supervised learning, in which multi-class data points are pushed or pulled in a high dimensional space to modulate a favorable embedding of low dimensionality for classification [15]. Furthermore, DNE effectively avoids the complication of a singularity matrix, as there is no need to calculate the inverse matrix. Based on the main characteristics of DNE, it can present a good solution for the small sample size (SSS) and the out-of-sample problems [50]. Although this comprehensive technique is effective in pattern classification, it still cannot uphold the data's geometrical structure information. The main steps of the DNE algorithm are as follows [39]: (1) The adjacent matrix H of graph G which refers to the underlying supervised manifold structure is as follows: where c i is the class label of x i and knn (i) is the set of k nearest neighbors of x i . Note that each edge is weighed +1 or −1 respectively in order to determine the local intra-class attraction and inter-class repulsion between neighboring points. (2) The optimal transformation of matrix P can be defined as follows: The minimization problem can be reduced to: argmin tr P T XLX T P Subject to P T P = I, where L = D − H and D ii = ∑ j H ij is a diagonal matrix. Like LPP, parameter P (projection matrix) can be optimized by calculating the minimum eigenvalue solution to the generalized Eigen value problem as follows: where P is constituted by the r eigenvectors corresponding to its first smallest negative eigenvalues of d, i.e.,

Locality-Based Discriminant Neighborhood Embedding
In the multi-class classification assignment, N data points should be classified. The problem that arises here is finding a circumlocutory manifold embedded subspace. Based on the DNE, there are two classifications for the important characteristic of the manifold structure, namely inter-class expulsion, which is the interaction between pairs of neighbors from the same class, and intra-class absorption, meaning the interaction between pairs of neighbors from different classes [15]. These two classes can be defined as follows (Figure 1) [15]: (1) Intra-class absorption: the interaction between pairs of neighbors from the same class.
(2) Inter-class expulsion: the interaction between pairs of neighbors from different classes.
The minimization problem can be reduced to: Subject to = , where = − and = ∑ is a diagonal matrix. Like LPP, parameter P (projection matrix) can be optimized by calculating the minimum eigenvalue solution to the generalized Eigen value problem as follows: where P is constituted by the r eigenvectors corresponding to its first smallest negative eigenvalues of d, i.e., λ ≤ λ ≤...≤ λ < 0 ≤ λ .

Locality-Based Discriminant Neighborhood Embedding
In the multi-class classification assignment, N data points should be classified. The problem that arises here is finding a circumlocutory manifold embedded subspace. Based on the DNE, there are two classifications for the important characteristic of the manifold structure, namely inter-class expulsion, which is the interaction between pairs of neighbors from the same class, and intra-class absorption, meaning the interaction between pairs of neighbors from different classes [15]. These two classes can be defined as follows (Figure 1) [15]: (1) Intra-class absorption: the interaction between pairs of neighbors from the same class.
(2) Inter-class expulsion: the interaction between pairs of neighbors from different classes. It is possible to classify all data points based on absorption interaction or distracting behavior using these two classes. Therefore, in the subspace, neighbors from a similar class are absorbed, whereas neighbors from a distinct class become detachable. In order to formulate the method, first we consider that is a data point, ( ) is the intra-class neighbors of , ( ) denotes the inter-class neighbor of , and ( ) represents all the neighbors of . Thus, to actualize this task with the purpose of understanding these two classes, edges between and inter-class neighbors and intra-class neighbors are indicated using various weights. Denoted weights are calculated using a kernel function based on the disparity between and its nearby neighbors. Diacritical adjacent It is possible to classify all data points based on absorption interaction or distracting behavior using these two classes. Therefore, in the subspace, neighbors from a similar class are absorbed, whereas neighbors from a distinct class become detachable. In order to formulate the method, first we consider that x i is a data point, N s (x i ) is the intra-class neighbors of x i , N d (x i ) denotes the inter-class neighbor of x i , and N(x i ) represents all the neighbors of x i . Thus, to actualize this task with the purpose of understanding these two classes, edges between x i and inter-class neighbors and intra-class neighbors are indicated using various weights. Denoted weights are calculated using a kernel function based on the disparity between x i and its nearby neighbors. Diacritical adjacent graph D can be obtained by the -neighborhood. The diacritical adjacency weight matrix (DAWM) F is described as [15,20]: where t is the regulator, N s k (x i ) is the intra-class neighbors of x i , and N d k (x i ) is the inter-class neighbors of x i in a k-neighborhood. It is obvious that different samples lead to different classification results. In view of the fact that the individual feature space location of a sample indicates its conditions, a parameter has been defined to regulate adjacent weight between pairs of neighbors. This regulator can be formulated as: To gain intra-class absorption and inter-class expulsion in the transformed space, it is recommended to apply a linear mapping method to project the intra-class absorption and inter-class abhorrence of input data points. As a result, the new low dimensional space can be defined as: (1) Intra-class absorption: (2) Inter-class expulsion: x j ∈ N k (x i ) or x i ∈ N k and c i = c j Finally, the difference between weighted distancse from each data point to the inter-class neighbors in N d (x i ) and those from x i to the intra-class neighbors in N s (x i ) in the mapped space must be calculated and, by maximizing this measurement, we can obtain the optimum result. This measurement can be referred to as a margin, which is calculated as follows: Thus, if the original data points are close together, this margin can keep the projected data points as close as possible. However, we can prevent x i and x j from being mapped far apart if they are close by defending the retribution generation [15]:

Main Idea
As already mentioned in Section 1, DNE cannot correctly preserve local information of data because it only concedes +1 to intra-class and −1 to inter-class neighbors, so it might fail to find out the most significant submanifold for pattern classification. In addition, LPP is designed based on "locality" since it has no direct connection with classification, and it still suffers from the "over learning of locality" problem [17]. Therefore, LDNE has been proposed to overcome the problems existing in LPP and DNE, considering both the "locality" and the "discrimination" in a unified modeling setting. However, it does not guarantee an appropriate projection for classification purposes, because many important non-linear data might be lost during its dimensionality reduction process. In some cases, LDNE also cannot distinguish inter-class and intra-class neighbors properly in order to conduct projection for all points, which can degrade the classification performance. In order to address these problems, we propose a new supervised subspace learning method named "supervised kernel locality-based discriminant neighborhood embedding" (SKLDNE). Combined with nonlinear data structure, locality and discrimination information, SKLDNE can yield an optimal subspace that best finds the indispensable submanifolds-based structure.
In our proposed SKLDNE, we first use nonlinear kernel mapping to represent the input data in implied feature space F. Afterwards, a linear transformation is searched to retain within-class geometric structures in the feature space. Hence, we can achieve a nonlinear subspace that is able to estimate the essential geometric structure of the face manifold. As a matter of fact, the proposed SKLDNE is modeled to take the nonlinear data in the feature space while important features of data including "locality" and "discrimination" are simultaneously preserved. In order to clearly elucidate the performance of our SKLDNE, we have compared it with several dimensionality reduction techniques including PCA, KPCA, LDA, UDP, LPP, DNE, and LDNE on six different publicly available datasets.

Mathematics
Suppose X = [x 1 , x 2 , . . . , x n ] is a set of d-dimensional input samples. This input data is projected onto a higher dimensional feature space F via nonlinear mapping ∅: R n →F. Then, manifold learning is then carried out on the projected samples ∅(X) = [∅(x 1 ), ∅(x 2 ), . . . , ∅(x n )]. Now assume that we have to find a projection transformation V ∅ in F. The optimization problem can be expressed as: Subject to V ∅ T V ∅ = I, where I denotes the identity matrix, z i = v T ∅ ∅(x i ) and z j = v T ∅ ∅(x j ) are the projection of ∅(x i ) and ∅(x j ) respectively onto V ∅ , and F ij represents the relationship between of x i and x j . The optimization problem can be kernelized as: This equation can be rewritten from the square of norm in Equation (15) into the form of trace as follows: The linear transformation should lie in the span of ∅(x 1 ), ∅(x 2 ), . . . , ∅(x n ) and α = [α 1 , α 2 , . . . , α n ] are the expansion coefficient vectors and Substituting (17) into (16) we obtain: where D ii = ∑ j F(i, j) is diagonal matrix and L = D − F, L and D are the symmetric matrix and represent the number of eigenvalues. The optimization problem can be rewritten as: Subject to α T Kα = I, where K is a kernel matrix with k(xi, xj) = [ ∅(x i ).∅(x j )], and a kernel in matrix form is: The corresponding generalized eigenvalue problem can be solved through calculating the maximum eigenvalues in ∅(X)L ∅(X) T V ∅ = λV ∅ , where the generalized eigenvector equivalent to the biggest eigenvalue is the main interest. Finally, we need to compute the dot product through a kernel and find its nearest neighbor in the embedding space. It should be noted that, although SKLDNE and other methods like DNE consider geometrical information of both intra-class and inter-class, the construction of their contiguous graphs and weights are done in very diverse ways.

Biometrics Application: Results and Analysis
In this part, to have a reliable and powerful comparison, the implementation of the suggested SKLDNE technique was compared with PCA, KPCA, LDA, UDP, LPP, DNE, and LDNE by broad comparative experiments on six different publicly available datasets, i.e., Sheffield, Yale face, ORL face, Head Pose, Finger Vein, and Finger Knuckle Databases [15]. For each dataset, depending on the number of data for each class, some samples were randomly chosen as training samples, whereas the rest of that class was chosen for testing. Furthermore, in the recognition phase [15], due to simplicity and in order to make the recognition outcome favorable, the nearest neighbor was utilized as a classifier. K-neighborhood variable k for computing the weight matrix is marked by Wk in the posterior explanations. In all the experiments, to achieve fair comparisons, Wk was selected as Wk = Tn − 1 (where Tn is training number samples of every class). Based on our experiments, all aforementioned methods achieved the optimal recognition rate with this value of Wk. It should be noted that there are two different kernels called Polynomial and Gaussian kernel. Based on our experimental results, the Polynomial has shown better performance than the Gaussian one. Therefore, only the polynomial kernel type was applied in this research.

Experimental Results with the Sheffield Face Database
The Sheffield Face Database (SH.F) [51] includes a total of 20 individuals, with 564 images in which each subject is illustrated, including poses ranging from profile to forehead views. The format of all images is PGM, image size is approximately 220 × 220 pixels in 256-bit grayscale. Figure 2 illustrates a one-subject sample with different poses from the Sheffield face multi view. In our experimentation, every image was resized to 112 × 92 pixels [52]. The maximal rate of recognition of each method and the related dimension implemented in the Sheffield database are illustrated in Table 1. The best performance result among other methods is assigned in bold-text. In addition, in all experiments due to the large number of implementations, it was decided to select some training and testing images that were more challenging for the classification task, in order to evaluate the performance of each technique in these critical areas. Thus, considering the small training sample size problem (SSS), first we selected a small number of training samples and then some large numbers to evaluate the performance of our SKLDNE method accurately. Tn = 1, 2, 3, 4, 5, 6, 7, 8, 15, 16, 17 were the training sample numbers selected from each class.
The Sheffield Face Database (SH.F) [51] includes a total of 20 individuals, with 564 images in which each subject is illustrated, including poses ranging from profile to forehead views. The format of all images is PGM, image size is approximately 220 × 220 pixels in 256-bit grayscale. Figure 2 illustrates a one-subject sample with different poses from the Sheffield face multi view. In our experimentation, every image was resized to 112 × 92 pixels [52]. The maximal rate of recognition of each method and the related dimension implemented in the Sheffield database are illustrated in Table 1. The best performance result among other methods is assigned in bold-text. In addition, in all experiments due to the large number of implementations, it was decided to select some training and testing images that were more challenging for the classification task, in order to evaluate the performance of each technique in these critical areas. Thus, considering the small training sample size problem (SSS), first we selected a small number of training samples and then some large numbers to evaluate the performance of our SKLDNE method accurately. Tn = 1, 2, 3, 4, 5, 6,7,8,15,16,17 were the training sample numbers selected from each class.
According to Table 1, three main conclusions can be drawn. First, SKLDNE significantly outperformed other methods (PCA, KPCA, UDP, LPP, LDA, DNE, and LDNE) over an extensive range of dimensionality for all different numbers of training and testing images, whether the training sample size was large or small. As can be observed, when the training sample number was small, SKLDNE clearly behaved more efficiently than all other recognition techniques, which proves the robustness of this method in the case of the small training sample size problem (SSS). Secondly, it can be seen that the recognition rates of all implementations were better when more training samples are used. Third, as shown in Figure 3, SKLDNE provided the best results at the lowest number of dimension values compared to its competitors. In addition, when dimensionality increased to about 30, the recognition accuracy of each method first increased rapidly and then became stable. The differences between the recognition rate of SKLDNE and other methods were emphasized when the training sample number was very small. However, for the larger training sample, the mentioned differences in the classification rate increased. For instance, in the training number of 1, 2, 3, SKLDNE achieved much better results than others. Moreover, for Tn = 17, SKLDNE reached 100% recognition rate, while the results were much lower for other methods. SKLDNE can effectively yield an optimal embedding subspace that finds substantial submanifolds-based structure with lower dimensionality. The within-class local structure, which is very important for face recognition, can be preserved simultaneously in a nonlinear kernel feature space. SKLDNE can solve the "out-of-sample" problem and the "overlearning of locality" problem in manifold learning, which other aforementioned methods often face. Therefore, based on these characteristics, it can be concluded that the recommended SKLDNE technique is a promising technique to be used for dimensionality reduction, with a very satisfactory performance in classification when dealing with high-dimensional data.

Experimental Results with the Yale Face Database
The Yale face database [53] consists of 165 images (grayscale) taken from 15 people [54] under diverse facial expressions and light conditions. The data include 11 images for every individual, with various facial expressions or environmental conditions, such as: normal, wearing glasses and no glasses, center-light, left and right light, happy, winking, sleepy, surprised, and sad. In the experimental results, every image was first cropped and then was resized to 32 × 32 pixels [55]. Figure 4a,b illustrates some sample images of one individual of the Yale database and corresponding cropped images. diverse facial expressions and light conditions. The data include 11 images for every individual, with various facial expressions or environmental conditions, such as: normal, wearing glasses and no glasses, center-light, left and right light, happy, winking, sleepy, surprised, and sad. In the experimental results, every image was first cropped and then was resized to 32 × 32 pixels [55]. Figure 4a,b illustrates some sample images of one individual of the Yale database and corresponding cropped images.  PCA was used in all methods for feature extraction, and all methods include a PCA phase. The optimum rate of recognition of each technique and the equivalent dimension implemented in the Yale database are illustrated in Table 2. Furthermore, in all experiments, due to the large quantity of implementations, it was decided to select some training and testing images that were more challenging for the classification task to clarify the performance of the aforementioned dimensionality reduction techniques in these critical areas. Considering the small training sample size problem (SSS), we first selected a training number of 1 sample and then some larger numbers from 6 to 9 to evaluate the performance of our SKLDNE method in some common existing problems such as the SSS problem and the out-of-sample problem.  PCA was used in all methods for feature extraction, and all methods include a PCA phase. The optimum rate of recognition of each technique and the equivalent dimension implemented in the Yale database are illustrated in Table 2. Furthermore, in all experiments, due to the large quantity of implementations, it was decided to select some training and testing images that were more challenging for the classification task to clarify the performance of the aforementioned dimensionality reduction techniques in these critical areas. Considering the small training sample size problem (SSS), we first selected a training number of 1 sample and then some larger numbers from 6 to 9 to evaluate the performance of our SKLDNE method in some common existing problems such as the SSS problem and the out-of-sample problem. Table 2 shows that the SKLDNE method achieved the highest accuracy in 100% of implementations in the Yale Database. In Figure 5a-e, the comparative classification accuracies are plotted for each given Tn (training number) in each dataset through changing the dimensionality of the transformation matrix. As is shown, the recognition rate of each technique increased promptly till the dimensionality was almost 40, and then it stabilized. It can be observed in Table 2 that SKLDNE was implemented more efficiently than others among a wide variety of dimensionality in the Yale Face database. Meanwhile, the best implementation of SKLDNE was achieved at a smaller dimension value in most of training numbers for each data set compared to LDNE. Moreover, differences in the classification between SKLDNE and other methods are very clear, especially when training number was small, for instance, Tn = 1, 2. For training number 7, SKLDNE yielded an improvement of around 4.5% compared with DNE, LPP, LDA, and KPCA, and 6.6% in comparison with LDNE, UDP, and PCA respectively. For training number 9, both SKLDNE and LDNE gained 100% accuracy, while accuracies in other methods with the same training number were much lower. In order to explain the superiority of the proposed method, our SKLDNE first mapped the data in the kernel space to capture the substantial extracted data and then both geometrical and discriminant information of the data were taken, benefiting from a significant form of the affinity weight matrix to embed the graph. Although LPP, DNE, and LDNE outperforming PCA, KPCA, and UDP demonstrates that the discriminant and local data structure based methods are more suitable for face recognition, our SKLDNE had more nonlinear data representation, locality preservation, and discriminating power than other methods, and consequently achieved the best recognition accuracy. Therefore, based on the mentioned characteristics of SKLDNE, it can be concluded that our SKLDNE can overcome the "SSS," "out-of-sample," and "overlearning" problems.  Table 2 shows that the SKLDNE method achieved the highest accuracy in 100% of implementations in the Yale Database. In Figure 5a-e, the comparative classification accuracies are plotted for each given Tn (training number) in each dataset through changing the dimensionality of the transformation matrix. As is shown, the recognition rate of each technique increased promptly till the dimensionality was almost 40, and then it stabilized. It can be observed in Table 2 that SKLDNE was implemented more efficiently than others among a wide variety of dimensionality in the Yale Face database. Meanwhile, the best implementation of SKLDNE was achieved at a smaller dimension value in most of training numbers for each data set compared to LDNE. Moreover, differences in the classification between SKLDNE and other methods are very clear, especially when training number was small, for instance, Tn = 1, 2. For training number 7, SKLDNE yielded an improvement of around 4.5% compared with DNE, LPP, LDA, and KPCA, and 6.6% in comparison with LDNE, UDP, and PCA respectively. For training number 9, both SKLDNE and LDNE gained 100% accuracy, while accuracies in other methods with the same training number were much lower. In order to explain the superiority of the proposed method, our SKLDNE first mapped the data in the kernel space to capture the substantial extracted data and then both geometrical and discriminant information of the data were taken, benefiting from a significant form of the affinity weight matrix to embed the graph. Although LPP, DNE, and LDNE outperforming PCA, KPCA, and UDP demonstrates that the discriminant and local data structure based methods are more suitable for face recognition, our SKLDNE had more nonlinear data representation, locality preservation, and discriminating power than other methods, and consequently achieved the best recognition accuracy. Therefore, based on the mentioned characteristics of SKLDNE, it can be concluded that our SKLDNE can overcome the "SSS," "out-of-sample," and "overlearning" problems.

Experimental Results with the ORL Database
The ORL face database [56] includes ten diverse grayscale images taken from 40 different individuals [57]. All 10 face images of each subject were captured at different times, with changes in the lighting, facial details (with glasses or no glasses) or facial expressions (smiling/not smiling, open/closed eyes,), against a dark homogeneous background, and with straight and frontal views. The size of all images is equal to 92 × 112 pixels in PGM format. The images were captured in 40 directories [58]. It should be noted that preprocessing was used and all original images were already cropped and resized. In this project, the size of 32 × 32 pixels was chosen for all ORL images. Figure 6 illustrates three different subjects (each with 10 images) from the ORL database.

Experimental Results with the ORL Database
The ORL face database [56] includes ten diverse grayscale images taken from 40 different individuals [57]. All 10 face images of each subject were captured at different times, with changes in the lighting, facial details (with glasses or no glasses) or facial expressions (smiling/not smiling, open/closed eyes,), against a dark homogeneous background, and with straight and frontal views. The size of all images is equal to 92 × 112 pixels in PGM format. The images were captured in 40 directories [58]. It should be noted that preprocessing was used and all original images were already cropped and resized. In this project, the size of 32 × 32 pixels was chosen for all ORL images. Figure 6 illustrates three different subjects (each with 10 images) from the ORL database. the lighting, facial details (with glasses or no glasses) or facial expressions (smiling/not smiling, open/closed eyes,), against a dark homogeneous background, and with straight and frontal views. The size of all images is equal to 92 × 112 pixels in PGM format. The images were captured in 40 directories [58]. It should be noted that preprocessing was used and all original images were already cropped and resized. In this project, the size of 32 × 32 pixels was chosen for all ORL images. Figure 6 illustrates three different subjects (each with 10 images) from the ORL database. Figure 6. Example of six different subjects (each with 4 images) from the ORL database [56]. Figure 6. Example of six different subjects (each with 4 images) from the ORL database [56].
In our experiments, the number of training samples Tn = 1, 4, 3, 4, 5, 6, 7, 8 were chosen from the dataset related to each subject to make the training sample set. The other numbers of images are applied as a testing sample set. As already mentioned, PCA was used in the classification phase in all methods. The maximal average accuracy (in percentage terms) and its corresponding dimension, followed by the alteration in the training sample sizes, are illustrated in Table 3. It can be observed in this yable and in Figure 7a-f that SKLDNE generally outperformed LDNE, whether the number of training sample size was small or not, in almost smaller numbers of dimensions. Moreover, as a supervised method, SKLDNE also significantly outperformed other techniques (KPCA, LPP, DNE, UDP, and LDA) regardless of the change in the training sample size. Compared to other techniques, SKLDNE performed better in small training sample size case. Furthermore, when the training number was equal to 8, SKLDNE had a zero error rate compared to that of PCA (4.1%), KPCA (3.5%), UDP (4%), LPP (3.5%), LDA (4%), DNE (3.75), and LDNE (3.5%). SKLDNE can simultaneously discover interclass and intraclass geometrical information, and have more nonlinear data representation, locality preservation, and discriminating power than other techniques. Therefore, SKLDNE does have merit over other techniques in terms of resolving classification problems in face recognition. This characteristic of SKLDNE in small sample size cases is indeed important to improve the recognition rate in practice, since face recognition is commonly a small sample size problem. Normally, a small number of images of each person is accessible in many real-world tasks.

Experimental Results with the Head Pose Database
The Head Pose database (Figure 8) [59] includes 2790 images of faces taken from 15 individuals with pan and tilt variation angles from −90 to +90 degrees. For each individual, 2 sets with 93 different positions were captured [60].

Experimental Results with the Head Pose Database
The Head Pose database [59] includes 2790 images of faces taken from 15 individuals with pan and tilt variation angles from −90 to +90 degrees. For each individual, 2 sets with 93 different positions were captured [60]. As can be observed in Table 4, SKLDNE performed better than the other seven methods, regardless of the variation in the training sample size. The maximal recognition rate of SKLDNE when Tn = 130 was up to 99.28%, while for other methods it was much lower. This reveals that, when the given training sample size for each class gets larger, SKLDNE can obtain much better results than other methods. Two more points can also be outlined. First, our supervised method with kernel weighting can notably enhance the class classification performance, but applying the kernel trick has no significant influence on PCA performance. Second, SKLDNE achieves optimal recognition rates at almost smaller number of dimensions as the recognition rate of SKLDNE retains the best results as the dimension varies from 14 to 30. Compared to the other techniques, SKLDNE preserves the more discriminative and local features of face images. It also preserves more local geometric relations of the within-class samples by nonlinear kernel mapping. It should be noted that linear methods such as LPP, LDA, LDNE, DNE, and UDP often fail to deliver good classification performance when face images are subject to complex nonlinear changes such as expression, lighting, pose and so on. Figure  9 indicates that the recognition implementations of all methods first sharply increase while the projected dimensions are added, and then, after obtaining the optimum, they tend to become stable. 60.83 (22) 98.94 (22) 99.01 (18) 99.28 (14) Figure 8. A subset of some images of one subject from the Head Pose database [59].
As can be observed in Table 4, SKLDNE performed better than the other seven methods, regardless of the variation in the training sample size. The maximal recognition rate of SKLDNE when Tn = 130 was up to 99.28%, while for other methods it was much lower. This reveals that, when the given training sample size for each class gets larger, SKLDNE can obtain much better results than other methods. Two more points can also be outlined. First, our supervised method with kernel weighting can notably enhance the class classification performance, but applying the kernel trick has no significant influence on PCA performance. Second, SKLDNE achieves optimal recognition rates at almost smaller number of dimensions as the recognition rate of SKLDNE retains the best results as the dimension varies from 14 to 30. Compared to the other techniques, SKLDNE preserves the more discriminative and local features of face images. It also preserves more local geometric relations of the within-class samples by nonlinear kernel mapping. It should be noted that linear methods such as LPP, LDA, LDNE, DNE, and UDP often fail to deliver good classification performance when face images are subject to complex nonlinear changes such as expression, lighting, pose and so on. Figure 9 indicates that the recognition implementations of all methods first sharply increase while the projected dimensions are added, and then, after obtaining the optimum, they tend to become stable.

Experimental Results with the Finger Vein Database.
The finger vein database used in this project was collected from 51 individuals (male and female) who were aged between 21 and 56 [61]. 10 images were captured from each subject. Four fingers were used for capturing, including right and left middle finger and right and left index finger. There are 204 different fingers in the database, and the data consist of 2040 images in total, in which each finger

Experimental Results with the Finger Vein Database.
The finger vein database used in this project was collected from 51 individuals (male and female) who were aged between 21 and 56 [61]. 10 images were captured from each subject. Four fingers were used for capturing, including right and left middle finger and right and left index finger. There are 204 different fingers in the database, and the data consist of 2040 images in total, in which each finger image originally had a dimension of 480 × 160 pixels. In our implementations, each image was resized to 32 × 32. The captured images from one person can be seen in Figure 10. image originally had a dimension of 480 × 160 pixels. In our implementations, each image was resized to 32 × 32. The captured images from one person can be seen in Figure 10. In this section, the performance of each technique followed by a maximal recognition rate was evaluated through changing the dimensionality of transformation matrix. The number of training samples Tn = 1, 2, 3, 4, 5, 6, 7 was chosen from the image gallery to make the training sample set, and the rest of images were selected for testing [62]. Furthermore, the PCA classifier (nearest neighbor NN using the Euclidean distance) was applied in the recognition phase, and the best recognition rates of each technique are indicated in Table 5.
Based on the experiment results shown in Figure 11, our SKLDNE gained the best recognition rate among all different training numbers, which proves that it has an encouraging performance compared to other advanced methods. Regarding the small training sample size case, the SKLDNE method still showed that it performed significantly better than other techniques, as the maximal recognition accuracy rate of SKLDNE in training number 2 was almost 5% more than PCA, KPCA, UDP, LPP, and DNE, and 1.2% more than LDNE. Our SKLDNE was always able to represent its optimal embedding space with a lower value of dimensions in comparison with the other seven techniques. For example, SKLDNE for Tn = 7 achieved 100% recognition accuracy in the smallest value of projected dimensions (26). This conveys that our approach is more effective, due to its significant characteristic that it can not only represent nonlinear and complex variations in of images, but can also model both locality of LPP and discrimination of DNE simultaneously. This fact demonstrates the good performance of our proposed method. In this section, the performance of each technique followed by a maximal recognition rate was evaluated through changing the dimensionality of transformation matrix. The number of training samples Tn = 1, 2, 3, 4, 5, 6, 7 was chosen from the image gallery to make the training sample set, and the rest of images were selected for testing [62]. Furthermore, the PCA classifier (nearest neighbor NN using the Euclidean distance) was applied in the recognition phase, and the best recognition rates of each technique are indicated in Table 5.
Based on the experiment results shown in Figure 11, our SKLDNE gained the best recognition rate among all different training numbers, which proves that it has an encouraging performance compared to other advanced methods. Regarding the small training sample size case, the SKLDNE method still showed that it performed significantly better than other techniques, as the maximal recognition accuracy rate of SKLDNE in training number 2 was almost 5% more than PCA, KPCA, UDP, LPP, and DNE, and 1.2% more than LDNE. Our SKLDNE was always able to represent its optimal embedding space with a lower value of dimensions in comparison with the other seven techniques. For example, SKLDNE for Tn = 7 achieved 100% recognition accuracy in the smallest value of projected dimensions (26). This conveys that our approach is more effective, due to its significant characteristic that it can not only represent nonlinear and complex variations in of images, but can also model both locality of LPP and discrimination of DNE simultaneously. This fact demonstrates the good performance of our proposed method. (g) Figure 11. (a-g). Comparative recognition results changing the dimensionality of the transformation matrix for each given training number Tn (Finger Vein database).

Experimental Results with the Finger Knuckle Print Database
This database is organized by the Polytechnic University of Hong Kong [63], and is freely available online. Based on the database description, finger knuckle print (FKP) images were collected from 165 individual volunteers (males and females) [64]. The samples were collected in two distinct sessions, and in each one 6 images were captured from 4 fingers (including left and right index finger and the left and right middle finger). Therefore, 7920 finger images in total were taken from 660 various fingers. In our experiments, the original image of the database is cropped and is then resized to 32 × 32 pixels. Figure 12 illustrates a cropped sample of the FKP database.  Table 6 shows that the SKLDNE recognition performance was significantly more efficient than other techniques, regardless of the variation in the training sample size, in the Finger Knuckle database. The recognition rate of SKLDNE when Tn = 11 was equal to 100%, while for the LDNE it was equal to 97.2%. Another point that is worth mentioning on the comparison of recognition performance in SKLDNE and other methods is related to the small training sample size case, as SKLDNE had the best performance in this respect. Again, all best recognition performances of SKLDNE were almost achieved at smaller values of dimension on every Tn per data set. It can also be observed that, when the given training sample size of each class became larger, SKLDNE achieved much better results than other techniques. For example, for Tn = 6, the accuracy of SKLDNE was around 8% and 21% more than LDNE and PCA, respectively. In order to well explain the superiority of SKLDNE, it should be noted that it can preserve more effective nonlinear features and more geometrical and discriminant information so, it can tackle the small sample size, the out-of-sample, and the "overlearning of locality" problems. Hence, it can be concluded that the SKLDNE approach is a promising supervised technique with a satisfactory performance of classification in order to be

Experimental Results with the Finger Knuckle Print Database
This database is organized by the Polytechnic University of Hong Kong [63], and is freely available online. Based on the database description, finger knuckle print (FKP) images were collected from 165 individual volunteers (males and females) [64]. The samples were collected in two distinct sessions, and in each one 6 images were captured from 4 fingers (including left and right index finger and the left and right middle finger). Therefore, 7920 finger images in total were taken from 660 various fingers. In our experiments, the original image of the database is cropped and is then resized to 32 × 32 pixels. Figure 12 illustrates a cropped sample of the FKP database.

Experimental Results with the Finger Knuckle Print Database
This database is organized by the Polytechnic University of Hong Kong [63], and is freely available online. Based on the database description, finger knuckle print (FKP) images were collected from 165 individual volunteers (males and females) [64]. The samples were collected in two distinct sessions, and in each one 6 images were captured from 4 fingers (including left and right index finger and the left and right middle finger). Therefore, 7920 finger images in total were taken from 660 various fingers. In our experiments, the original image of the database is cropped and is then resized to 32 × 32 pixels. Figure 12 illustrates a cropped sample of the FKP database.  Table 6 shows that the SKLDNE recognition performance was significantly more efficient than other techniques, regardless of the variation in the training sample size, in the Finger Knuckle database. The recognition rate of SKLDNE when Tn = 11 was equal to 100%, while for the LDNE it was equal to 97.2%. Another point that is worth mentioning on the comparison of recognition performance in SKLDNE and other methods is related to the small training sample size case, as SKLDNE had the best performance in this respect. Again, all best recognition performances of SKLDNE were almost achieved at smaller values of dimension on every Tn per data set. It can also be observed that, when the given training sample size of each class became larger, SKLDNE achieved much better results than other techniques. For example, for Tn = 6, the accuracy of SKLDNE was around 8% and 21% more than LDNE and PCA, respectively. In order to well explain the superiority of SKLDNE, it should be noted that it can preserve more effective nonlinear features and more geometrical and discriminant information so, it can tackle the small sample size, the out-of-sample, and the "overlearning of locality" problems. Hence, it can be concluded that the SKLDNE approach is a promising supervised technique with a satisfactory performance of classification in order to be   Table 6 shows that the SKLDNE recognition performance was significantly more efficient than other techniques, regardless of the variation in the training sample size, in the Finger Knuckle database. The recognition rate of SKLDNE when Tn = 11 was equal to 100%, while for the LDNE it was equal to 97.2%. Another point that is worth mentioning on the comparison of recognition performance in SKLDNE and other methods is related to the small training sample size case, as SKLDNE had the best performance in this respect. Again, all best recognition performances of SKLDNE were almost achieved at smaller values of dimension on every Tn per data set. It can also be observed that, when the given training sample size of each class became larger, SKLDNE achieved much better results than other techniques. For example, for Tn = 6, the accuracy of SKLDNE was around 8% and 21% more than LDNE and PCA, respectively. In order to well explain the superiority of SKLDNE, it should be noted that it can preserve more effective nonlinear features and more geometrical and discriminant information so, it can tackle the small sample size, the out-of-sample, and the "overlearning of locality" problems. Hence, it can be concluded that the SKLDNE approach is a promising supervised technique with a satisfactory performance of classification in order to be applied in the Finger Knuckle database. The recognition rates in comparison with the variation of dimensions for each Tn are shown in Figure 13a-i.

Classification Performance and Computational Cost
In this section, by means of a new group of experiments, we evaluated SKLDNE performance through changing k-neighborhood variation Wk (from 1 to 30 with a scale of 2). The number of training samples selected for each database were Tn = 4,5,6,7,8,15,16,17 in Sheffield,Tn = 2,6,7,8,9 in Yale,Tn = 5,6,7,8,9 in ORL, Tn = 120, 130, 140, 150, 160 in Head Pose, Tn = 5, 6,7,8,9 in Finger Vein,and Tn = 7,8,9,10,11 in Finger Knuckle databases. It should be noted that the rest of each class in each dataset was used for testing. The training number samples were randomly selected. The maximum recognition rates of SKLDNE in comparison with Wk for different numbers of training samples are indicated in Figure 14a  It can be seen that the classification performance became better as the number of training samples increased. The classification performance of SKLDNE improved first with a rise of Wk until almost Wk = 9 in the Sheffield and Head Pose, and then it decreased dramatically. In the Yale, ORL, Finger Vein, and Finger Knuckle datasets, it can easily be observed that the recognition performance of SKLDNE enhanced rapidly when Wk varied from 1 to 7 in Yale, 1 to 10 in ORL, 1 to 9 in Finger Vein, from 1 to 6 in Finger Knuckle, and then it decreases when Wk became larger, since large values of k-neighborhood variable Wk have an effect on creating the weight adjacent matrix. It has already been proven that the k-neighborhood selected for data point might contain more outliers belonging to other classes at a large number of Wk when a dataset includes many classes with a small number of samples for each class. Thus, the constructed adjacent weight matrix does not have sufficient discrimination for image recognition [15].
The experiment for analyzing computational cost was carried out on an Intel(R) Core i5-4200U CPU, 2.3 GHz, 10 GB RAM machine using MATLAB (R2016, Natick, MA, USA). The computational costs of the different classification methods using the Yale data base are listed in Table 7. It can be seen that the classification performance became better as the number of training samples increased. The classification performance of SKLDNE improved first with a rise of Wk until almost Wk = 9 in the Sheffield and Head Pose, and then it decreased dramatically. In the Yale, ORL, Finger Vein, and Finger Knuckle datasets, it can easily be observed that the recognition performance of SKLDNE enhanced rapidly when Wk varied from 1 to 7 in Yale, 1 to 10 in ORL, 1 to 9 in Finger Vein, from 1 to 6 in Finger Knuckle, and then it decreases when Wk became larger, since large values of k-neighborhood variable Wk have an effect on creating the weight adjacent matrix. It has already been proven that the k-neighborhood selected for data point might contain more outliers belonging to other classes at a large number of Wk when a dataset includes many classes with a small number of samples for each class. Thus, the constructed adjacent weight matrix does not have sufficient discrimination for image recognition [15].
The experiment for analyzing computational cost was carried out on an Intel(R) Core i5-4200U CPU, 2.3 GHz, 10 GB RAM machine using MATLAB (R2016, Natick, MA, USA). The computational costs of the different classification methods using the Yale data base are listed in Table 7. From the results shown in Table 7, it can be observed that the proposed SKLDNE was faster than its main competitors, such as LDNE, DNE, and LPP. The processing times of PCA, KPCA, UDP, and LDA were lower. However, the recognition rate results illustrate that these methods were much less accurate than the SKLDNE method. Now, to have a more reliable comparison, we briefly compare our recognition results of the proposed method with previously published works, including a deep learning method named Deep Belief Networks (DBNs) [65,66] with a traditional multilayer perceptron model (MLP) [66] in the used facial expression databases, i.e., the JAFFE Database [67]. As a deep learning method, DBNs have an unsupervised feature learning ability. The JAFFE database includes 10 individuals (Japanese women) with 7 different expressions, and has around 3 or 4 images for each expression. There are 213 images in total in this database. Each image has a resolution pixel of 256 × 256. In detail, we divided all image samples into 10 parts, 90% of which were applied for training, and the remaining were applied for testing. Table 8 illustrates the recognition performance comparisons in the JAFFE database when dealing with three different image resolutions of 16 × 16, 32 × 32, and 64 × 64. We can clearly see that the proposed SKLDNE method achieved the best recognition performance (100% in all cases), in comparison with other previously reported results [66], which are much lower. This is attributed to the main characteristics of SKLDNE, which effectively represents more nonlinear data structure and has more locality and discrimination information preserving power. The results again show the robustness of SKLDNE for facial expression recognition.

Conclusions and Future Research
In this study, the performance of several well-known pattern recognition strategies was analyzed to clarify which techniques are best suited to be applied in face recognition. We also analyzed the weakness and robustness of each technique. As already mentioned, DNE cannot correctly preserve local information of data because it only assigns +1 to intra-class and −1 to inter-class neighbors, so it might fail to discover the most significant submanifolds for pattern classification. LPP is designed based on "locality" since it has no direct connection with classification, and it still suffers from the "over learning of locality" problem. LDNE has been proposed to overcome the problems existing in LPP and DNE; however, it does not guarantee an appropriate projection for classification purposes because many important non-linear data might be lost during its dimensionality reduction process. In addition, in some cases, LDNE cannot distinguish inter-class and intra-class neighbors well either in order to conduct projection for all points. This can degrade the classification performance. In order to address these problems, we have proposed a new supervised subspace learning algorithm named "supervised kernel locality-based discriminant neighborhood embedding" (SKLDNE). Combined with nonlinear data structures, locality, and discrimination information, SKLDNE can yield an optimal subspace that best finds the indispensable submanifolds-based structure. Six publicly available datasets, i.e., Yale face, ORL face, Sheffield, Head Pose, Finger Vein and Finger Knuckle, were used to illustrate the significance of the proposed technique. Experimental results reveal that SKLDNE outperforms and demonstrates potential to be implemented in real world systems compared to other advanced dimensionality reduction methods by obtaining the highest recognition rates in all experiments. Representing complex nonlinear variations makes SKLDNE more powerful and more intuitive than LDNE and other aforementioned techniques in terms of classification. According to the results, SKLDNE could also resolve the "small training sample size" problem, and it had the best performance compared to others at smaller numbers of projected dimensions in each number of training samples per data set. Moreover, when the given training sample size for each class grew larger, SKLDNE also achieved much better results than other techniques. The overlearning of locality problem and the out-of-sample problem in manifold learning can be avoided by applying our developed classifier. As a future plan, we will modify this classifier in order to make it directly applicable for two-dimensional data to effectively reduce the computational costs.