Uniform Generic Representation for Single Sample Face Recognition

In this article, we propose a uniform generic representation (UGR) method to solve the single sample per person (SSPP) problem in face recognition, which aims to find consistency between the global and local generic representations. For the local generic representation, we require the probe patches of the same image to be constructed respectively by the corresponding patches of the same gallery image and the intra-class variation dictionaries. Therefore, the probe patches’ coefficients, corresponding to patch gallery dictionaries, should be similar to each other. For the global generic representation, the probe image’s coefficient, corresponding to the gallery dictionary, should be similar to those of its probe patches. In order to meet the two requirements, we combine local generic representation with global generic representation in soft form. We obtain the representation coefficients by solving a simple quadratic optimization problem. UGR has been evaluated on Extended Yale B, AR, CMU-PIE, and LFW databases. Experimental results show the robustness and effectiveness of our method to illumination, expression, occlusion, time variation, and pose.


I. INTRODUCTION
In the past few decades, face recognition (FR) has been one of the most popular research topics in computer vision and pattern recognition due to its potential application. Researchers have carried out extensive research on FR and made remarkable progress [1]. FR technology under controllable conditions has achieved satisfactory results. However, under uncontrollable circumstances, FR is still challenging due to the influence of illumination, expression, posture, occlusion, and other factors. The direct way to solve these problems is to increase training samples. But in some real scenes, such as access control, e-passport, identity card verification, judicial confirmation, etc., usually only one training sample can be obtained, which is the so-called single sample per person (SSPP) problem [2]. In that case, though some famous FR methods [3]- [5] can still be applied, they suffer from serious performance drop. And linear discriminant analysis The associate editor coordinating the review of this manuscript and approving it for publication was Alberto Cano .
(LDA)-based methods [6], [7] even fail to work since they are all designed assuming that multiple samples per person are available. The absence of facial variation information makes it more challenging to fulfill the FR task.
Recently, many specially designed FR methods have been proposed to solve the SSPP problem, such as singular value decomposition (SVD)-based perturbation algorithms [8], [9] and randomized sampling-based algorithm [10]. The three methods generate multiple virtual images based on the original image. They then extract the features by applying LDA, which makes the learned features redundant, because substantial correlation exists among the generated images. And other methods like discriminative multi-manifold analysis (DMMA) [11], sparse discriminative multi-manifold embedding (SDMME) [12], and robust heterogeneous discriminative analysis (RHDA) [13] obtain multiple samples for each subject by dividing each gallery image into a collection of patches. They treat each subject as a manifold and learn multiple discriminative feature spaces. These methods can work well when the probe image does not contain extreme variation VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ compared with the gallery image, which is unrealistic. All of the above approaches have the same problem of being unable to handle facial variations.
To deal with the lack of the intra-class variation information, an additional generic training dataset is introduced. Deng et al. [14] proposed a prototype plus variation (P + V) model to cope with the large variation between the probe and gallery images. Zhu et al. [15] proposed a local generic representation (LGR) by using correntropy to minimize the loss since they found that the representation residuals are non-Gaussian distributed. Gao et al. [16] presented the idea that the patches of a probe image should be constructed by the patches of the corresponding gallery image. Thus, they proposed a regularized patch-based representation (RPR) by imposing a group sparsity constraint on the reconstruction coefficients. The size of the original variation dictionary from the generic training dataset is usually very large, which will result in expensive computational cost when solving the optimization problem. Therefore, the authors of [16]- [19] proposed to learn a variation dictionary, respectively. Benefiting from the external variation information, these generic learning-based methods boost the performance of FR with SSPP significantly. However, these methods only focus on a single scale of the image. In fact, images or patches on different scales can provide complementary information, which is neglected by existing methods.
In this article, we propose a novel method to solve the SSPP problem, named uniform generic representation (UGR), which aims to unify the local representation and global representation. Like the common local generic representation, each image is divided into overlapped patches. And then, each probe patch is represented by the patch gallery dictionary and patch variation dictionary at the corresponding location. Ideally, the probe patches of the same image should be constructed by the patches from the corresponding gallery image, i.e., the representation coefficients of the probe patches of the same image, corresponding to patch gallery dictionary, should be equal to each other, no matter what the size of the patch is. And from the view of global generic representation, the representation coefficient of the probe image, corresponding to gallery dictionary, should be equal to those of the patches of the same probe image in the local-representation case, which is illustrated in Fig. 1. Given this, we propose solving the optimization problems of local and global representations together to find the uniform coefficient. However, in reality, the constraint of finding the uniform coefficient is so strong as to make the non-discriminative or corrupted patches mislead the discriminative ones. Therefore, we relax the constraint by making the representation coefficients of patches close to the uniform coefficient instead of equal. The proposed representation can be regarded as a soft combination of local and global representations. In our method, both local and global representations can give full play to their advantages, and the consistency between local and global representations can be explored. After solving the optimization problem, the strategy FIGURE 1. The illustration. Regardless of the intra-class variations, the probe image and its probe patches should be constructed respectively by the corresponding gallery image and its gallery patches, i.e., ideally, z = z i . Even if z = z i , they should be similar.
of majority voting is adopted, i.e., classify each patch of an image and then decide the image's label by the voting of all the patches' classification results. Experiments on four public face databases are performed to demonstrate the effectiveness of the proposed method.
Our contributions are summarized as follows. 1) Global representation and local representation have their own advantages. The global representation has good robustness for recognizing those indiscriminative regions such as forehead and cheek, while the local representation can mitigate the effect of facial variations. In this article, we explore how to use global representation to guide local representation. We combine local generic representation with global generic representation in soft form. The consistency between local generic representation and global generic representation is explored for the first time. The soft combination also maintains local diversity.
2) Existing methods for dealing with the SSPP problem extract either local or global features. In our method, we make use of both local and global features. We evaluate our method by multiple means.
We organize the rest of this article as follows. In Section II, we introduce the related work. In Section III, we describe the proposed uniform generic representation. Section IV presents the experimental results and discussion. We conclude the paper in Section V.

II. RELATED WORK
When designing a FR system with SSPP, we must deal with three key issues: 1) The dimension of data is much larger than the number of training samples. 2) It is unable to calculate the within-class scatter matrix, and the obtained between-class scatter matrix becomes inaccurate.
3) It is difficult to describe the probe sample with unknown variations.
For the first issue, some scholars increase training samples by generating extra ones, which can also alleviate the second issue. Authors of [8], [9] generate multiple training samples for each person by applying SVD-based perturbation, respectively. Li et al. [10] proposed to get multiple sub-images from a single sample by randomized down-sampling. These methods of generating virtual samples do not gain information except from the original image. Some scholars tend to apply feature extraction to decrease the dimension of the image and meanwhile keep discriminative information. Wu and Zhou [20] proposed a projection-combined principal component analysis ((PC) 2 A) method. And Chen et al. [21] further proposed enhanced projection-combined principal component analysis (E(PC) 2 A) method. The two methods only focus on mining more global information. In [22], the authors proposed to enhance the recognition ability by using the local descriptors (Gabor wavelets [23] and local binary patterns [24]). Zhou et al. [25] and Liu et al. [26] presented two bag-of-words feature-based methods to address the SSPP problem. The three methods [22], [25], [26] show that the local descriptors increase robustness to facial variation. Other scholars proposed to decrease the dimension of data by image partition. The methods using image partition are called patch-based methods, such as patch-based sparse representation for classification (PSRC) [5], patch-based collaborative representation for classification (PCRC) [27], and patch-based n-nearest classifier (PNN) [28]. In addition, Liu et al. [29] proposed to build a subspace for each patch and then explore the relationships between subspaces. Patch-based methods utilize local information, and most of them predict the image's label based on majority voting. Therefore, the patch-based methods can mitigate the effect of facial variations.
For the second issue, one approach is to generate extra training samples for each individual, which is discussed above. Another approach is to divide each gallery image into patches and treat them as training samples of each class. Based on this conception, Chen et al. [30] proposed a block-based Fisher linear discriminative analysis (BlockFLDA) method. Lu et al. [11] treated each person as a manifold and formulated FR as a multi-manifold matching problem. Yan et al. [31] further proposed a multi-feature multi-manifold learning method. Zhang et al. [12] proposed to learn discriminative features by embedding two sparse graphs into multi-manifold learning. Combining the advantages of [11] and [12], Pang et al. [13] proposed to conduct discriminant analysis by introducing a graph-based Fisherlike criterion. Besides the two approaches mentioned here, there is another approach of dealing with the second issue. Chu et al. [32] proposed to get extra training samples by mirroring the right half of a face to the left and perform feature extraction at a half-face level.
For the third issue, the usual solution is to introduce an additional generic training dataset since the facial variations of different people share much similarity. For example, Deng et al. [14] proposed an extended SRC (ESRC) method by coding the difference between the gallery and probe samples with a generic training dataset. Su et al. [2] proposed a generic discriminant model to infer the variation information of the gallery dataset. Yang et al. [33] proposed to learn a sparse variation dictionary, which is adaptive to the gallery dataset. Ji et al. [34] proposed to construct an adaptive probabilistic label matrix for each probe sample. Deng et al. [22] proposed to embed equidistant prototypes into the linear regression model and map the intra-class differences calculated from the generic training dataset to zero vectors. The above five methods get the variation information in five different ways and have good robustness for recognizing those indiscriminative regions such as forehead and cheek as they all treat the entire image as a feature vector. There are also other methods that incorporate the generic variation information into patch-based methods, such as LGR [15] and RPR [16]. They can achieve better performance by taking advantage of the generic training dataset and image partition. Besides introducing a generic training dataset, Li et al. [35] proposed to capture the variation information by preserving the variations of each probe sample onto training samples.
It is generally known that deep learning-based methods have achieved great breakthroughs in the FR field. In view of the success of deep learning, some researchers proposed to utilize a deep model to deal with the SSPP problem. For instance, Gao et al. [36] proposed a stacked supervised auto-encoder (SSAE) by enforcing the faces with variations to be mapped with the corresponding canonical face. AbdElmaksoud et al. [37] proposed to augment the training data by synthesizing 3D faces from 2D face photos, and then train different networks to recognize faces. Tu et al. [38] proposed a 3D face modeling module to extract invariant features and generate 2D images with different variations. Hong et al. [39] proposed to jointly perform feature extraction, domain adaptation, and classification by applying a deep architecture with domain-adversarial training. Yang et al. [40] proposed to use convolutional neural networks (CNN) to extract the local adaptive features. Zeng et al. [41] proposed to utilize generated virtual samples to fine-tune a well-trained deep CNN model. Deep learning-based methods can dramatically improve the performance on the LFW (Labeled Faces in the Wild) database. However, they usually require tremendous auxiliary data to train the neural network, which will lead to high calculation cost, and it is not easy to tune the parameters.

III. UNIFORM GENERIC REPRESENTATION
In this section, we will first introduce the global and local generic representations and then propose the uniform generic representation. We also give the solution of the proposed optimization problem.

A. GLOBAL GENERIC REPRESENTATION
Under the SSPP condition, like [14], we assume that the facial intra-class variations of different people are similar, and the intra-class variant basis can be acquired from a sufficient number of generic faces. Based on this assumption, we use VOLUME 8, 2020 additional generic training dataset to construct the intra-class variation dictionary. For clarity, we denote the single neutral images in the generic dataset by G = [g 1 , · · · , g L ] ∈ R d×L , where g i corresponds to the neutral image from the i-th subject, and we denote the variation images by G t , where superscript t represents the type of variation. G t is a counterpart of G with the variation type t. Suppose that there are M types of variations in the generic dataset, then we construct the intra-class variation dictionary D by using the difference between G t and G: We are given a gallery dataset {a 1 , · · · , a K } with K classes, where a k is the only single gallery image of the k-th subject. The gallery dataset is used as the prototype dictionary, denoted by A = [a 1 , · · · , a K ] ∈ R d×K . Then, the probe image x can be reconstructed by utilizing the prototype dictionary A and the intra-class variation dictionary D, which is formulated as follows: where z and s are the reconstruction coefficients, and e is the residual. For multiple probe images, (2) can be rewritten as follows: where are the reconstruction coefficient matrices, and E is the residual matrix. The reconstruction coefficients can be obtained by solving the following optimization problem: where · F denotes the Frobenius norm of a matrix, λ 1 > 0 and β 1 > 0 are the weighting parameters.

B. LOCAL GENERIC REPRESENTATION
The local generic representation is a patch-based version of the global generic representation. We divide the prototype dictionary A into C overlapped patches, denoted by , and the superscript indicates the serial number of the patch. The probe images X and the intra-class variation dictionary D are divided respectively into {X 1 , · · · , X C } and {D 1 , · · · , D C } in the same way. With A i and D i , X i can be represented as: where ∈ R LM ×N are the reconstruction coefficients of X i over A i and D i , respectively, and E i is the residual matrix. We can obtain the reconstruction coefficients by solving the following optimization problem: where λ 2 > 0 and β 2 > 0 are the weighting parameters.

C. UNIFORM GENERIC REPRESENTATION
In Section III-A and III-B, we have introduced global and local generic representations. The global representation has good robustness for recognizing those indiscriminative regions such as forehead and cheek, while the local representation can mitigate the effect of facial variations. We aim to integrate the advantages of both the global representation and the local representation. From the perspective of the local representation, regardless of the variation dictionary, the patches of the same probe image should be constructed respectively by the corresponding patches from the right gallery image, i.e., ideally, From the perspective of the global representation, the probe image should be constructed by the corresponding gallery image, which means that z j = z 1 j = z 2 j = · · · = z C j = [0, 0, · · · , 1, · · · , 0] T . The above idea can be formulated as the following representation: However, in practice, some face patches like cheek and forehead are not very discriminative and thus may be constructed by the wrong gallery patches. And even if the discriminative patches are constructed mainly by the corresponding gallery patches, the values of the representation coefficients z i j (i = 1, · · · , C) are not necessarily equal to each other. Considering the two points, the constraint of directly finding the uniform coefficient z j is unfit. Therefore, we relax the constraint by making the representation coefficients of the probe patches close to the uniform coefficient instead of equal. The proposed representation is as follows: Z +H i will be close to Z as long as we make H i small. We can get the reconstruction coefficients by solving the following optimization problem: ∈ R K ×N , and λ > 0, β > 0, and γ > 0 are the weighting parameters. The regularization term γ H i 2 F balances the relationship between global and local representations. The proposed representation can be seen as the combination of (3) and (5) since the obtained Z satisfies (3) while the obtained Z + H i satisfies (5), which means that the proposed representation can ensure the global consistency as well as the local diversity. In our proposed method, we can make use of both global and local features. The available information for our proposed representation is more than that for (3) or (5).
To improve the robustness of the representation and increase the classification power, we extract the eight nearest neighbor patches of the probe and gallery patches. For a pixel on the margin of an image, we first use mirror transform and then extract its neighbors. The extracted neighbors also help to cope with the local deformation (e.g., misalignment). Corresponding to the local representation, we also extract the eight neighbors of the probe and gallery samples for the global representation.
Since images should be represented by the base elements of the class that they belong to, we propose a simple classification method. The i-th patch's classification result of the probe sample j is determined by where δ k : R n → R n is a map function that remains the coefficients associated with the k-th class and lets the others be zero, p = 0 indicates the center patch, and p = 1, · · · , 8 indicates the eight neighbors of the center patch. Then the final classification result of the probe sample j can be obtained by the voting of all the patches' classification results.

D. SOLUTION OF THE PROPOSED OPTIMIZATION PROBLEM
The optimization problem (9) is a minimization problem of the convex quadratic function. It has a closed-form solution. By substituting constraints into the minimization function, we convert (9) to the following form: Then we solve the minimum of (11). We take the derivatives with respect to Z , S, H i , S i , respectively, and set the derivatives to be zero. We get the following four equations: where I is an identity matrix. We get Z and H i by solving the above equations: where and C is the number of the patches.

IV. EXPERIMENTAL RESULTS AND DISCUSSION
In this section, we evaluate the proposed algorithm using the Extended Yale B [42], AR [43], CMU-PIE [44], and LFW [45] databases. For the global generic representation, we resize the face images to 30 × 30. For the local generic representation, we resize the face images to 80×80 and fix the patch size to 20 × 20, and the interval κ between the centers of two adjacent patches is 10 pixel. We set λ = 3, β = 0.01, and γ = 0.01 in the experiments.

A. EXTENDED YALE B DATABASE
The Extended Yale B face database contains 2432 images of 38 individuals taken under 64 lighting conditions. Its extreme lighting conditions make it a challenging task for most face recognition methods. We show some sample images in Fig. 2.
The first 30 subjects are chosen for evaluation, while the other 8 subjects are used as the generic dataset. We use the frontal faces under normal illumination condition as the gallery dataset and the face images under other illumination conditions as the probe dataset. We list the recognition rates of all the methods and their computation times of recognizing one probe image in Table 1. One can see that the recognition rate of our method reaches 91.5%, ranking first. Although FLDA-single and LRA*-GL take the least time, their recognition rates are not satisfactory. The topic model DpLSA achieves the third-highest recognition rate by extracting bag-of-words features. Patch-based methods like   PSRC, PCRC, and PNN achieve higher recognition rates than generic-learning based methods like AGL, LRA*-GL, and SVDL, which means that image partition can help a lot under extreme lighting conditions. By taking advantage of image partition and the generic training dataset, LGR and JCR-ACF perform relatively well. But they are inferior to our method. Compared with LGR and JCR-ACF, the recognition rate of our method has increased by 5.7% and 1.3%, respectively. This is because our method benefits from the global consistency besides image partition and the generic training dataset. Moreover, our method runs much faster than LGR and JCR-ACF.

B. AR DATABASE
The AR face database contains over 4,000 color face images of 126 individuals, including frontal faces with different lighting conditions, facial expressions, and occlusions. Each individual has 26 pictures taken in two separate sessions. Fig. 3 shows some sample images from AR database.
Following the experimental setup in [14], we carry out experiments with a subset of 100 subjects. The first 80 subjects from session 1 and session 2 are used for evaluation, and the other 20 subjects from session 1 are used as the generic dataset. The images with neutral expression and under normal illumination condition from session 1 are used as the gallery samples. Table 2 and Table 3 show respectively experimental results on two sessions.
It can be seen that DpLSA achieves the highest recognition rates on session 1, reaching nearly 100%. Furthermore, on session 2, DpLSA also gets the first place. The average recognition rate of our method is lower than that of DpLSA, but DpLSA uses bag-of-words features, while our method only uses raw pixel features. Even so, our method achieves a higher recognition rate in the case of illumination, which is consistent with the experimental results on Extended Yale  B database. Although the generic dataset is not used, the performance of PSRC and PCRC is better than that of ESRC and SVDL on two sessions. This is because patch-based methods can deal with occlusion well. By exploiting the variation information and local features, LGR, JCR-ACF, and the proposed UGR outperform PSRC and PCRC on two sessions. And UGR also utilizes global features, so it works best among all competitive methods of using raw pixel features.

C. CMU-PIE DATABASE
The CMU pose, illumination, and expression (CMU-PIE) database contains more than 40,000 facial images of 68 individuals. For each individual, the images are taken across 13 different poses, under 43 different illumination conditions, and with four different expressions. We show some sample images in Fig. 4.
We evaluate our method using the first 48 subjects and build the generic dataset using the remaining 20 subjects. The neutral face images taken with the frontal pose (C27) are used as the gallery samples, and the remaining images with the poses C05, C07, C09, C27, and C29 are used as the probe samples. The results of different methods are listed in Table 4 from which we can see that the proposed UGR performs best in all competitive methods. For the poses C05 and C29, our method achieves the best results, but the recognition rates are only about 70%. The reason may be that the poses (looking left and right) cause the misalignment issue, and the reflection of light increases the difficulty of recognition. For the poses C07, C09, and C29, UGR achieves significantly better performance than the other methods, which demonstrates that our method is more robust to pose change than other methods.

D. LFW DATABASE
The LFW database contains over 13,000 face images of 5,749 individuals taken under an unconstrained environment. LFW-a is a version of LFW after alignment using VOLUME 8, 2020   commercial software tool [46]. 158 subjects with no less than 10 samples are gathered from LFW-a. And we collected 10 face images for each subject. Some sample images are shown in Fig. 5.
The first 50 subjects are chosen for evaluation, while the rest are used as the generic dataset. Because in this database, no frontal neutral face images exist, we select one image in sequence as the gallery sample and use the remaining nine images as probe samples. Thus ten experiments are performed. We report the average recognition rate. For the generic learning-based methods, we use each person's mean face as the neutral image in the generic dataset.
The performance of different methods is presented in Table 5 from which one can see that the recognition rates of all the methods are a little low. This is because the gray features can not represent human face well in unconstrained environment. Still, the proposed UGR outperforms all the other methods. In [40], JCR-ACF achieves exciting performance on the LFW database by taking advantage of the adaptive deep convolutional features. In this article, we also evaluate our method with the convolutional features of the LFW database provided by Yang [40]. Because the convolutional features are extracted in patches, we concatenate the patch features of one face as its global features. The experimental results are listed in Table 6. It can be seen that the results in Table 6 have a dramatic improvement compared with those in Table 5, which indicates that the convolutional features have good distinctiveness.
We further compare our method using the convolutional features with four deep learning-based methods, i.e., DeepID [47], JCR-ACF [40], VGG-Face [48], and SGL+Lightened CNN [19]. Since the implementation requires tremendous auxiliary data and many tips for fine-tuning parameters, we cite the existing results of VGG-Face and SGL+Lightened CNN in [19]. The authors of [19] declare that they take the protocol in JCR-ACF [40]. The results of DeepID and JCR-ACF are reported in accordance with [40]. We also take the same protocol and use the convolutional features provided by [40]. The performance is shown in Table 7. It can be seen that our method outperforms the other methods except SGL+LightenedCNN. SGL achieves the highest accuracy by employing LightenedCNN features. However, SGL is solved by alternating optimization, while our method has a closed-form solution. It is easy to implement our method.

E. EVALUATION OF THE PROPOSED REPRESENTATION 1) PARAMETER EVALUATION
In this section, we analyze the influences of λ, β and γ to the performance of our method. Experiments are carried out on Extended Yale B and AR (session 1 + session 2) databases. We tune β and γ within the range of {0.005, 0.01, 0.05, 0.1, 0.5} and λ within the range of {1, 2, 3, 4}. The performance of UGR under different parameter combinations is shown in Fig. 6. We observe that the performance tends to deteriorate when β and γ are too large. On Extended Yale B database, the performance of UGR reaches best when β = 0.01 and γ = 0.005 or 0.01. On AR database, the performance of UGR is stable for the changes of λ and β when γ is fixed. And UGR works best when γ = 0.005 or 0.01. For the sake of simplicity, we fix λ = 3, β = 0.01, and γ = 0.01.

2) MAGNITUDE OF Z WITH RESPECT TO H i
In this section, we measure the magnitude of Z with respect to H i . We use the following equation to calculate the magnitude: where C is the numper of patches, mean(·) calculates the mean value of all elements of a matrix, and the division is performed by element. Equation (18) reflects the magnitude relationship between the elements of Z and H i . We carry out experiments on AR database. The results are shown in Table 8. We can see that in all cases, the value of Z is more than five times that of H i . This means that Z + H i is close to Z .

3) NUMBER OF NEAREST NEIGHBOR PATCHES
As mentioned in the text, adding neighbor patches can improve the robustness of the representation and increase the classification power. In this section, we will verify this and analyze the impact of the number of extracted nearest neighbor patches on the performance of our method.
A center patch has eight nearest neighbor patches, as shown in Fig. 7. We carry out experiments with different numbers of extracted nearest neighbor patches on AR (session 1 + session 2) database. Six different settings are listed in Table 9. For S1, we add all eight nearest neighbor patches to the classification experiment. For S2, we use six nearest neighbor patches of two combinations. For S3, S4, and S5, we use 2, 4, and 6 combinations, respectively. And for S6, we do not use the neighbor patches. For each setting, we carry out experiments with the listed combinations and report the average recognition rate in Table 10. We can see that the recognition rate decreases as the number of neighbor patches decreases. The recognition rate is only 82.6% without using neighbor patches, which is 12.8% lower than that of using eight neighbor patches. The results show the importance of using neighbor patches.

4) SCALE EVALUATION
In this section, we discuss the impact of scales on recognition performance.    different combinations of global size and patch size is shown in Fig. 8. One can see that when the patch size is fixed to 20×20, the performance is the highest regardless of the global size. On Extended Yale B database, the performance with the global size of 20 × 20 and the patch size of 20 × 20 is higher than others. On AR database, the performance with the global size of 30 × 30 and the patch size of 20 × 20 is higher than others. The reason for this difference may be that Extended Yale B database contains extreme lighting conditions which make the obtained discriminating information contaminated, and images of different global sizes contain different amount of contaminated information. For Extended Yale B database, the global size of 20 × 20 may be more suitable. To make the parameters of all experiments consistent, we fix the global size to 30 × 30 and the patch size to 20 × 20.

5) NUMBER OF LOCAL REGIONS
In this section, we evaluate the impact of the number of local regions on recognition performance. The global size and the patch size are fixed, and we get different numbers of local regions by changing the interval κ. Table 11 shows the recognition rates of our method with different numbers of local regions. It can be noticed that the recognition rates of our method decrease as the number of local regions reduces. But referring to the results of other methods in Table 1, Table 2,  and Table 3, our method is still very competitive.

V. CONCLUSION
In this article, we integrate the global and local generic representations into a uniform framework to address the SSPP problem. The consistency between global and local generic representations is explored. In the proposed method, the whole image and its patches are employed simultaneously. Therefore, the available information for our proposed method is more than that for alone global or local generic representations. Experiments on the Extended Yale B, AR, CMU-PIE, and LFW databases show that our method works well for the SSPP problem. Especially when dealing with illumination variation, our method using raw pixel features is superior to DpLSA using bag-of-words features. The proposed representation is scalable. In future work, we can combine global representation with multiple local representations based on different patch sizes. The idea of consistency explored in this work will continue. In this work, we put forward a quadratic optimization problem in response to the proposed representation. When multiple representations are combined, the quadratic optimization problem will become complicated. Therefore, how to put forward the optimization problem is worth thinking about. It will be discussed in the future.