Elsevier

Knowledge-Based Systems

Volume 181, 1 October 2019, 104790
Knowledge-Based Systems

Robust joint representation with triple local feature for face recognition with single sample per person

https://doi.org/10.1016/j.knosys.2019.05.033Get rights and content

Highlights

  • RJR-TLF is proposed for FR with SSPP, making innovations in both feature extraction and classifier design.

  • Triple local feature exploits the robustness and discrimination of local scale, orientation and space of the face image.

  • Robust joint representation jointly combines local features with their weights adaptively distributed to further enhance the robustness.

  • The proposed RJR-TLF is evaluated extensively on popular databases with promising results.

Abstract

Face recognition (FR) with a single training sample per person (SSPP) is a representative small-sample-size classification problem and occurs in many practical scenarios such as law enforcement, surveillance, identity card, e-passport, etc. By using intra-class variations extracted from an additional training set (i.e., a set excluding gallery subjects) as a generic intra-class variation dictionary, sparse representation based classification (SRC) has been extended to FR with SSPP. However, for FR with SSPP, how to achieve high robustness to gross facial variations (e.g., complex facial lighting, expression and pose variations and various outliers of corruption, occlusion and disguise) is still an open issue. In this paper, we propose a novel model, named robust joint representation with triple local feature (RJR-TLF), to address this issue from the viewpoints of feature extraction and classifier design. In feature extraction, we design robust triple local features, i.e., Gabor facial features with multiple scales and multiple orientations extracted in different facial local regions (e.g., local patches centered around dense regularly sampled points and detected particular points including nose tip, eye centers, etc.), to naturally encode the local scale, local orientation and local space information of a face image. For face images, the densely and regularly sampled facial regions can provide a comprehensive description; the sparsely and particularly detected facial regions can exploit a discriminative description because they cover the most informative facial regions and can be detected robustly. In classifier design, we propose a robust joint representation framework to exploit the distinctiveness and similarity of different local information by requiring triple local features from the same type of Gabor feature (i.e., with the same scale and orientation) to have similar representation coefficients. With the coefficient-similarity constraint and the robust representation fidelity term representing the query image on the single-sample gallery set and the generic intra-class variation dictionary, the local features with large-representation residuals actually indicate corrupted regions with gross facial variations and will be assigned low weights adaptively to reduce their effects on the representation and classification, which further strengthens the robustness of RJR-TLF. The proposed RJR-TLF is evaluated extensively on popular databases, including the AR, the large-scale CMU Multi-PIE, and the LFW databases. Experimental results demonstrate that RJR-TLF is much more robust to various facial variations than the recent FR with SSPP methods.

Introduction

Face recognition (FR) is an important area in computer vision and pattern recognition and has been drawing considerable attentions from the research community [1] for several decades. Nevertheless, many challenges, including varying ages, various appearance variations of the query facial images and the limited training-sample number, still exist in FR under uncontrolled or less controlled environments [2], [3], [4], [5], obstructing the process of FR technology. As one of the most challenging problems, FR with a single training sample per person (SSPP) occurs in many practical scenarios such as law enforcement, surveillance, identity card, e-passport, etc. The deficiency of training samples for each subject makes it hard to predict the various variations of the query samples. For FR with SSPP, how to robustly handle complex facial variations, outliers and corruption is still a challenging and an open problem.

The pattern classification community has witnessed greatprosperity since many classification algorithms as well as public software were presented, such as in [6] many well-established algorithms were revisited and examined on the classification performance and time efficiency, and in [7] an open-source software was specially released for multi-class imbalance learning. However, small-sample-size learning problems are still important issues attracting much attention from various fields, such as multiple-task transfer learning [8], novel category learning from a few annotated examples [9] and person re-identification distance metric learning [10], etc. For FR with SSPP, the performance of FR will adversely deteriorate [11] because there is no within-class variations provided for each subject. First, many conventional discriminative subspace algorithms (e.g., LDA and its variants [12]) cannot be directly applied since the within-class information cannot be estimated. Second, the sparse representation models, e.g., sparse representation based classification (SRC) [13], will suffer due to the fact that they need sufficient training samples for each person to reconstruct an adequate query sample. Accordingly, many methods specific to FR with SSPP have been invented and proposed. In regard to whether or not an additional generic training set, which is extra collected and excludes the gallery subjects, is necessary, those specially designed methods can be classified into two categories: methods without using a generic training set and methods requiring a generic training set.

The methods without using the generic training set for FR with SSPP include three common aspects: extracting robust local features (e.g., gradient orientation [14] and local binary pattern [15]), generating additional virtual training samples (e.g., via singular value decomposition [16], lower–upper decomposition [17] and geometric transform and photometric changes [18]) and performing image partitioning (e.g., local patch based LDA [19], self-organizing maps of local patches [20], local structure based multi-phase collaborative representation [21] and manifold learning from local patches [22], [23], [24]). Although these methods improve FR with SSPP to some degree, they cannot overcome the small-sample-size problem in FR with SSPP because they fail to introduce additional intra-class facial variation information into the single-sample gallery set. Additionally, the new information introduced by the virtual training sample generation is very limited due to the high correlation between virtual training samples and the original single gallery sample.

FR with SSPP methods requiring a generic training set can borrow very useful information (e.g., generic intra-class variations of facial images) from an additional training set (i.e., a set excluding gallery subjects). This benefits from the fact that facial variations for different subjects share similarities and a generic training set including a large number of generic subjects with multiple facial images is usually easily collected. Therefore, the generic training set is widely used in FR with SSPP to extract the discriminative information [25], [26], [27], [28], [29]. For example, [27] and [25] utilized a generic set to learn the expression-invariant subspace and pose-invariant subspace to reduce the affects of expression and pose variations, respectively. Hu et al. [28] introduced a discriminative transfer learning approach in which discriminant analysis is performed on the multiple-sample generic training set and then transferred to the single-sample gallery set. [29] proposed an equidistant prototypes embedding approach for FR with SSPP, in which a linear regression is learned to map the gallery samples and the intra-class facial differences of the generic faces to the equally distant locations and the zero vectors, respectively. Although the generic training set benefits these methods by introducing more facial variation information, these methods are still vulnerable to local and complex pose, expression and occlusion variations.

Another line of work using the generic training sets is applying methods based on sparse representation. The pioneer work, i.e., the sparse representation based classification (SRC) proposed by Wright et al. [13], achieved considerable success when enough training samples for each gallery subject are available. Aside from SRC which used sample-wise sparse representation, [30] argued that class-wise sparse representation would be better. Due to the requirement of enough training samples for each gallery subject, these methods cannot handle FR with SSPP well. In order to deal with FR with SSPP, Deng et al. [31] extended the SRC method and proposed the method of Extended SRC (ESRC), which computes an intra-class variation dictionary from a generic training set in order to represent the intra-class difference between the query images and single-sample gallery. Following ESRC, many works were proposed for FR with SSPP. Zhu et al. [32] proposed a local generic representation (LGR) based method, which extracts local patches from the facial image and utilizes the same strategy as ESRC to construct an intra-class variation dictionary for each patch. Ding et al. [33] presented a variational feature representation-based classification approach where a normal feature expected to reserve the identity information of the query sample is obtained. Dictionary learning for sparse representation has also been studied [34]. In FR with SSPP, a common variation dictionary is learned from a generic training set to improve the representation ability of a single-sample gallery set [35], [36], [37]. For instance, Yang et al. [35] proposed the sparse variation dictionary learning (SVDL) method to learn a sparse variation dictionary adaptive to the gallery set by jointly learning a projection to connect the generic intra-class variation with the gallery set. Zhuang et al. [36] proposed to learn an illumination variation dictionary to deal with image corruption and misalignment. [38] learned a robust auxiliary dictionary to handle under-sampled FR where the testing sample may be corrupted by occlusion. Gao et al. [37] introduced a regularized patch-based representation approach (RPR), in which the variation dictionary is learned for each patch.

Although the above sparse representation and dictionarylearning based methods which are specially designed for FR with SSPP have made progress, there are still several issues. First, ESRC [31], SVDL [35] and Zhuang’s method [36] all use global features which are easily affected by gross facial variations and outliers. Second, although LGR [32] and RPR [37] partition the face image regularly into local patches to boost the robustness, they fail to exploit the discrimination of particular facial regions such as the eyes and nose etc., which cover the most informative description of the face and prove to have high detection rates in various facial variations [39]. In addition, LGR introduces robust representation fidelity term to further deal with corrupted facial regions but represents each local patch independently, which cannot effectively utilize the joint representation of all local regions. Although the similarity of different local patches has been regularized in RPR, the representation fidelity term robust to facial variations is ignored. Furthermore, all the above methods adopt the intensity features without considering the other features, such as multi-scale and multi-orientation Gabor features [40], [41], [42] and LBP features [43], [44], [45], which have been successfully applied to FR to exploit the scale and orientation information of the face image. In addition, Curvelet transform [46] was also designed to represent images at different scales and orientations and was also used for face recognition [47], [48], [49], [50]. Besides the hand-crafted features, in recent years deep learning based methods [51], [52] tried to learn deep features from large-scale labeled face images and achieved pretty good FR performances. However, for FR with SSPP, the characteristic of one single training sample per class is insufficient to train the deep convolutional neural networks. To overcome this shortage and the domain gap between the gallery set and probe set, [53] generated synthetic images using a 3D model and proposed a method that combines face synthesis and a deep architecture with domain-adversarial learning. However, the high discrimination of the learnt features in [53] actually owes much to the pre-trained VGG-Face [51], which is further fine tuned.

Given the above issues of facial description and image classifiers in previous sparse representation based methods for FR with SSPP, in this paper we aim to fully exploit multiple local information of the face image, i.e., the information of the local scale, local orientation and local space, and design a robust and powerful classifier. Compared to previous facial features used in [32] and [37], multiple information is more discriminative than single information. This can be achieved by designing a reasonable feature. Considering that some local features can be affected by the corrupted parts of the face image and the joint representation of local features from the same image is more discriminative, we propose a classifier that is tailored to reduce the importance of the problematic local features and exploit the similarities of local features in their representation. As a result, we propose a robust joint representation with triple local feature (RJR-TLF) approach by considering both the feature extraction and the classifier design. In triple local feature extraction, we first densely sample the regular facial regions and sparsely select particular facial regions (such as the nose and eyes etc.). The particular local regions cover the most informative parts of the face, while the local regular regions are used to represent the face completely since the particular facial regions can only cover limited parts of the face. Then, by extracting multi-scale and multi-orientation Gabor features from each facial image, we obtain the Gabor features of all local regions as the triple local features, which naturally encode the information of local scale, local orientation and local space. For the robust joint representation, a robust representation fidelity term is introduced to reduce the effects of the gross facial variations (e.g., complex expression and pose variations and various outliers of corruption, occlusion and disguise). Meanwhile, different triple local features of the query image extracted from the same type of Gabor features (i.e., sharing the same scale and orientation) are required to have similar representation coefficients in the robust representation framework. With this coefficient-similarity constraint, the local features with large representation residuals actually indicate corrupted regions with gross facial variations and will be given lower weights to reduce their impact on the representation and classification processes. As a result, the robustness of the proposed RJR-TLF model to gross variations can be further enhanced. Here we briefly summarize our contributions. Unlike the previous Gabor features used in [40], [41], [42], where all or partial Gabor features with different scales and orientations are concatenated into one feature vector, our method views different triple local features (with different scales, orientations and spaces) independently in the feature extraction stage, but leaves their “integration” in the classification stage. Compared with previous sparse representation based methods for FR with SSPP (such as ESRC [31], SVDL [35], LGR [32] and RPR [37]) that use global or local intensity features, we design new robust triple features based on Gabor features. Besides designing new features, we also design a new sparse representation based model to further enhance the robustness to gross facial variations by introducing a robust representation fidelity term and jointly representing the triple local features from the same type of Gabor feature.

Extensive experiments have been conducted on facialdatabases with various variations, including illumination, expression, pose, session, and occlusion, etc. Especially, the proposed method is compared with the powerful local feature methods and discriminative sparse representation models, such as the multi-scale and multi-orientation local features of LBP [43], the Enhanced Local Texture Feature Sets (ELTFS) method [45], RPR [37], SVDL [35], and LGR [32]. The experimental results show that the proposed RJR-TLF can achieve much more robustness than the previous methods for FR with SSPP.

The rest of this paper is organized as follows. Section 2 presents a brief review of the related works. Section 3 gives the proposed RJR-TLF method. Section 4 describes the optimization algorithm of RJR-TLF. Section 5 conducts the experiments and Section 6 concludes the paper.

Section snippets

Brief review of the related works

In this section, we will review the closely related methods, ESRC [31] and LGR [32], in detail.

Robust joint representation with triple local feature

In this section, we propose a robust joint representation with triple local feature (RJR-TLF) for FR with SSPP. Inspired by local patches’ validity to handle facial variations (e.g., occlusion, small pose, etc.) and the powerful discrimination of Gabor features we densely sample regular local patches and sparsely detected particular local patches (e.g. eyes etc.) from the Gabor features of the face image, where the robust and discriminative triple local features, including local scale, local

Solving the algorithm of RJR-TLT

The original model of RJR-TLT changes to Eq. (11) after the Taylor expansion [59], which is naturally an alternative optimization problem. We solve the optimization problem in Eq. (11) by an iterative procedure, which alternately calculates αjk and updates ωjk. The iterative procedures are described as follows.

When the weight ωjk is fixed, the coding coefficient αjk can be derived as αjk=αjk,0+μPjkα¯jwhere αjk,0=PjkωjkGjkDjkTyjkand Pjk=ωjkGjkDjkTGjkDjk+λ+μI1Based on Kα¯j=k=1Kαjk, by summing αj

Experiments

In this section, we perform face recognition (FR) with single sample per person (SSPP) on benchmark facial databases, including the AR database [55], the large-scale CMU Multi-PIE database [56] and the Labeled Faces in the Wild (LFW) database [57], to demonstrate the performance of RJR-TLF. We first discuss the parameter setting in Section 5.1; in Section 5.2 we investigate the discrimination of the particular facial regions; in Sections 5.3, 5.4, and 5.5, we evaluate the performance of RJR-TLF

Conclusion

In order for FR with SSPP to achieve high robustness against gross facial variations, we proposed a robust joint representation with triple local feature (RJR-TLF) approach based on two considerations: feature extraction and classifier design. In feature extraction, the triple local features from the particular and regular local facial regions are extracted, the local information (local scale, orientation and space) of the facial image and the discriminative parts (e.g., eyes and nose) of the

Acknowledgments

This work is partially supported by the National Natural Science Foundation of China (Grant nos. 61772568, 61672357, 61602504 and 61703283), the Guangzhou Science and Technology Program, China (Grant no. 201804010288), the Fundamental Research Funds for the Central Universities, China (Grant no. 18lgzd15), the Guangdong Natural Science Foundation, China under Project 2017A030310067, and Shenzhen ScientificResearch and Development Funding Program (Grant no. JCYJ20170302153827712).

References (63)

  • WangJ. et al.

    On solving the face recognition problem with one training sample per subject

    Pattern Recognit.

    (2006)
  • DengW. et al.

    Equidistant prototypes embedding for single sample based face recognition with generic learning and incremental learning

    Pattern Recognit.

    (2014)
  • DingR. et al.

    Variational feature representation-based classification for face recognition with single sample per person

    J. Vis. Commun. Image Represent.

    (2015)
  • MandalT. et al.

    Curvelet based face recognition via dimension reduction

    Signal Process.

    (2009)
  • MohammedA.A. et al.

    Human face recognition based on multidimensional PCA and extreme learning machine

    Pattern Recognit.

    (2011)
  • ElaiwatS. et al.

    A curvelet-based approach for textured 3D face recognition

    Pattern Recognit.

    (2015)
  • GrossR. et al.

    Multi-pie. Image Vis. Comput.

    (2010)
  • ZhaoW. et al.

    Face recognition: A literature survey

    ACM Comput. Surv.

    (2003)
  • LiZ. et al.

    Aging face recognition: A hierarchical learning model based on local patterns selection

    IEEE Trans. Image Process.

    (2016)
  • WolfL. et al.

    Effective face recognition by combining multiple descriptors and learned background statistics

    IEEE PAMI

    (2011)
  • WengR. et al.

    Robust point set matching forpartial face recognition

    IEEE Trans. Image Process.

    (2016)
  • SahaB. et al.

    Multiple task transfer learning with small sample sizes

    Knowl. Inf. Syst.

    (2016)
  • WangY.X. et al.

    Learning to learn: Model regression networks for easy small sample learning

  • ZhangL. et al.

    Learning a discriminative null space for person re-identification

  • BelhumeurP.N. et al.

    Eigenfaces vs. fisherfaces: recognition using class specific linear projection

    IEEE PAMI

    (1997)
  • WrightJ. et al.

    Robust face recognition via sparse representation

    IEEE PAMI

    (2009)
  • TzimiropoulosG. et al.

    Subspace learning from image gradient orientations

    IEEE PAMI

    (2012)
  • AhonenT. et al.

    Face recognition with local binary patterns

  • ShanS. et al.

    Extended fisherface for face recognition from a single example image per person

  • TanX. et al.

    Recognizing partially occluded expression variant faces from single training image per person with SOM and soft k-NN ensemble

    IEEE NN

    (2005)
  • LuJ.W. et al.

    Discriminative multimanifold analysis for face recognition from a single training sample per person

    IEEE PAMI

    (2013)
  • Cited by (13)

    • Face photo–sketch synthesis via intra-domain enhancement

      2023, Knowledge-Based Systems
      Citation Excerpt :

      In the realm of law enforcement, it is generally necessary to use precise sketches to identify suspects [1–4].

    • Monogenic features based single sample face recognition by kernel sparse representation on multiple Riemannian manifolds

      2022, Neurocomputing
      Citation Excerpt :

      Regardless of the favorable aspects on convenient sample collection, saved data storage, and reduced computation cost, the inadequate training samples in SSFR poses additional difficulties in algorithm design, especially when multiple changing factors involved in the process of imaging the faces [1-5]. This is confirmed by the fact that linear discriminant analysis and its improved versions cannot be directly applied for feature extraction for SSFR, due to the lack of samples to compute the within class scatter [1,2]. To make SSFR applicable, many methods have been proposed in the literature that can be mainly classified into five categories [1-3]: holistic feature representation, patch/local- feature representation, virtual sample generation, generic learning and deep learning.

    • Secure collaborative few-shot learning

      2020, Knowledge-Based Systems
      Citation Excerpt :

      In recent years, deep learning [1,2] has achieved impressive success in different areas, such as speech recognition [3], handwritten digit recognition [4] and face detection [5,6], etc.

    • Single Sample Face Recognition Based on Identity-Attribute Disentanglement and Adversarial Feature Augmentation

      2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus

    No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.2019.05.033.

    View full text