Bi-directional CRC algorithm using CNN-based features for face classification

: Collaborative representation-based classification (CRC) has become a breakthrough in face classification due to its distinguished collaborative capacity. Nevertheless, insufficient observations of per subject are usually offered by few or even a single gallery image for face classification, which lead to a sensitive response to the variations from the original data set. In this study, the authors present a bi-directional CRC algorithm using convolutional neural network-based features for face classification. They first employ a deep convolutional neural network to extract facial features from the original gallery and query sets, and then develop a fast reverse representation model to obtain the auxiliary residual information between each training sample and the reconstructed one that is achieved from the test sample. Secondly, they offer a new solution to the bi-directional optimisation problem by which the input sample is well represented by the forward linear combination and the reverse one, respectively. The last contribution is to utilise a competitive fusion method for robust face recognition, which weighted reconstructed residuals from the bi-directional representation model. Experimental results obtained from a set of well-known face databases including AR, FERET, and ORL verify the validity of the proposed method, especially in the robustness to small sample size problem.


Introduction
It is well known that the amazing breakthrough of sparse representation-based classification (SRC) theory [1][2][3][4] facilitates the study of collaborative-representation-based classification (CRC) [5] in signal reconstruction. However, only very few or even a single face image is usually offered to perform face recognition in some practical applications. Despite SRC and CRC have utilised the sparsity properties in the dataset, they are unable to deal with the problem such as pose, illumination, expression, and occlusion changes very well. To solve this problem, many previous works have been presented to construct optimised form of representation, by synthesising virtual training images or enhancing the solution performance of the optimisation problem. For example, new face images of different poses are constructed by an example-based method and the recognition rate is improved in [6,7]. Meanwhile, face symmetry is very important for fast locating candidate samples, which has achieved success in applications of face detection and classification [8][9][10][11][12][13][14]. Recently, Xu et al. integrated the original image and the mirror image for representation-based face classification [15]. Another approach that generates 'symmetrical' synthesised face images is also exploited for face recognition by Xu et al. [16].
Conventional and inverse representation-based linear regression classification (CIRLRC) [17] applies the reverse representation into the algorithm for the first time. However, it has exploited a crude complementary representation-based classification (RBC) method; a large number of constructed systems of linear equations may lead to over-fitting because of the redundancy of auxiliary residual between each training sample and the reconstructed one. It also increases the computational cost of the traditional RBC to a great extent. In fact, few attempts have been made to avoid the uncertainty problem in terms of the reverse representation by evaluating the relationship between the forward representation and the reverse model. Uncertainty problem of the optimisation problem is usually derived from inaccurate information, or a lack of knowledge about the efficient key procedure of representation, for instance, the traditional CIRLRC model based on the raw feature space may come from the extremely complex data sources that are unreliable. Moreover, it is generally known that a convolutional neural network (CNN) [18,19] has drawn extensive attention due to its successful applications in a wide range of pattern recognition and computer vision. Also, the robust textural features can be extracted by the CNN network trained on a large number of face images, across a variety of variations such as illumination, pose, occlusion etc. This motivates us to exploit a new bi-directional CRC algorithm using CNN-based features (BCRC-CNN) for robust face classification.
In this study, we develop a BCRC-CNN algorithm to expand a fast hybrid RBC model in the CNN feature space using an efficient bi-directional representation strategy to evaluate the residual between each input sample and the reconstructed one. The contributions to our work are described as 4-fold below: • We employ a deep CNN, projecting the process of classification from the raw pixel space to the CNN feature space. To this end, we use the pre-trained VGGNet-16 model [20] for feature extraction to alleviate the adverse information from the original gallery and query set, improving the accuracy of the recognition. • We develop a fast reverse representation model to obtain the auxiliary residual information between each training sample and the reconstructed one that is achieved from the test sample, which is an efficient robust reverse representation mechanism. • We propose a new solution to the bi-directional optimisation problem by which the input sample is well represented by the forward representation and the reverse one, respectively. The method achieves the residuals from the bi-directional RBC model with the minimum cost. • We utilise a competitive fusion method to perform face classification based on the BCRC framework by weighting the reconstructed residuals from the bi-directional representation model. Finally, the label of the test sample is estimated by the scores of different subjects.
The rest of the paper is organised from the following four sections: Section 2 introduces the presentation of the VGGNet-16 model and the outline of CIRLRC. The proposed method is presented in Section 3 in detail. Section 4 evaluates the effectiveness of BCRC-CNN by experimenting on several common

Presentation of the VGGNet-16 model
The VGGNet [20] is originally developed by the Visual Geometry Group. In this study, we adopt the pre-trained VGGNet-16 model for feature extraction. The architecture of the VGGNet-16 network is shown in Fig. 1.
It has five convolution segments; each contains two or three convolution layers. Also, a maximum pooling layer is connected to reduce the size of images. Dropout is also used after the first and second full connection layer to avoid over-fitting.

Outline of CIRLRC
In this section, we will show the outline of the CIRLRC method. We suppose that there are K classes and each class has M training samples. Let x k (m) stand for the mth training sample of the kth subject.
In CIRLRC, the mirror test sample y v ∈ ℜ a × b for the original test sample y ∈ ℜ a × b is produced by where p = 1, …, a, q = 1, …, b and a, b stand for the numbers of rows and columns of the face image matrix, respectively, as well as y v (p, q) and y(p, q) denote the pixels located in the pth row and qth column of y v and y, respectively. Hence, the virtual training samples ], for the ith class, a test sample y is approximated using as a linear combination of the training samples from the ith class where α i ∈ ℜ 2 × M is the coefficient vector of the ith class. Thus, α i can be estimated using the least-square estimation where λ is a small positive constant. Then the score between the test sample and the ith class can be obtained by For the jth class, let Z j be Z j = [X 1 , …, X j − 1 , X j + 1 , …, X K , y, y v ], and the linear representation of each native training sample , m = 1, …, M from the jth class can be written as where β j (m) is calculated using β^j , and the deviation between the test sample and the mth training sample of the jth class is obtained by where β^j , respectively. The distance between the test sample and the jth class can be solved using Then the scores and distances of the test sample y with respect to all classes are normalised by where is used to calculate the ultimate score with respect to the jth class where w 1 and w 2 are the weights. Let w 2 be smaller than w 1 and w 1 + w 2 = 1. Hence, the label of the test sample is estimated using the index of the smallest values over all t j Label(y) = arg min j t j .

Proposed method
In this section, we will introduce BCRC-CNN in detail. Fig. 2 shows the schematic diagram of our proposed method and the main steps are as follows.
We suppose that there are K classes and each class has M training samples. Let y stand for the test sample and x k (m) stand for the mth training sample of the kth subject. Let , …, x K (M) be all the K times M training samples.
The first step is to use the pre-trained VGGNet-16 model [20] for feature extraction. As we all know, the CNN network trained on a large number of face images is able to extract robust textural features for face recognition, across a variety of appearance variations such as pose, expression, illumination, and occlusion. To this end, we adopt the pre-trained VGGNet-16 model for feature extraction. Thus, the original test sample and training samples can be projected into the feature space by The second step is to develop a fast bi-directional representation model, in which the input sample is well represented by the forward linear combination and the reverse one. The objective where A is the optimal coefficient solution to the forward presentation and Q is the optimal coefficient solution to the reverse presentation. The last three formulas are sparse constraint items. α, β, μ and λ are used to balance the effect of each term in (14). It can be proved that (14) is convex and differentiable, so the optimal solution to (14) is the stationary point of the objective function. The third step is to propose a new solution to the above bidirectional optimisation model. The partial derivatives of the objective function with respect to A and Q are computed as follows: ∂ ∂Q (∥ y′ − X′A ∥ 2 2 + α∥ y′Q − X′ ∥ 2 2 + β∥ QA ∥ 2 + μ∥ A ∥ 2 2 +λ∥ Q ∥ 2 2 ) = 2αy′ T y′Q − 2αy′ T X′ + 2βA T AQ + 2λQ .
Let f equal the objective function, then we have (∂ f /∂A) = − 2X ′T y′ + 2X ′T X′A + 2βQ T QA + 2μA and (∂ f /∂Q) = 2αy ′T y′Q − 2αy ′T X′ + 2βA T AQ + 2λQ. As a result, the stationary point of f can be calculated by (∂ f /∂A) = 0 and (∂ f /∂Q) = 0, which means (X ′T X′ + βQ T Q + μI)A = X ′T y′ and (αy ′T y′ + βA T A + λI)Q = αy ′T X′. Finally, the optimal solutions to (14) are Then the coefficient vectors A and Q can be iteratively computed by (17) and (18) until the predefined termination condition is satisfied. The last step is to weigh the reconstructed residuals from the bidirectional representation model for the ultimate classification. Let s i = ∥ y′ − X′ i A i ∥ and d i = ∥ X′ i − y′Q i ∥(i = 1, …, K) denote the forward score and the reverse distance between y and the ith subject, respectively. Then normalise them by and combine them by t i = w 1 s i ′ + w 2 d i ′ to calculate the ultimate score. w 1 and w 2 are the weights and w 1 + w 2 = 1. As forward representation being more reliable than inverse representation in evaluating the dissimilarity, the proposed algorithm assigns a larger value to w 1 than w 2 . Finally, the label of y is estimated using

Experimental results
In this section, we will evaluate the effectiveness of the BCRC-CNN algorithm and compare it with some state-of-the-art algorithms by experimenting on publicly available datasets, including AR [21], FERET [22] and ORL [23]. Images of faces in all three databases are varied with illumination, expression and pose. In all the experiments, we set the value of w 1 to 0.6, 0.7 and 0.8 to compare and expect better experimental results. The parameters α, β, μ and λ are the small positive constants and they were set as 0.5, 0.1, 0.01 and 0.01, respectively, for fair comparisons. (The coefficient α of the reverse expression was set to 0.5 because it had less influence than forward expression, whose abbreviatory coefficient was 1. β was equal to 0.1, which was used to balance the sparsity between forward representation and inverse representation. μ and λ were the small positive constants which were set as 0.01 to have only little influence on the classification performance.) Moreover, we repeated each experiment 20 times and took the average result. The AR database [21] contains about 1680 images of 120 individuals, including faces with different facial expressions, illuminations, and disguises. Each image was resized to 50 × 40. Fig. 3 shows some images from AR.
The FERET database [22] is a benchmark database for evaluating face classification algorithms. We select a subset of FERET for experimenting that includes 200 individuals, each of which provided seven images. We used the down-sampling algorithm to resize each image to 80 × 80. Fig. 4 shows some normalised images from the FERET database.
The ORL database [23] is composed of 400 images from 40 different persons. Some individuals were taken images at different times, with varying lighting, facial expressions, and facial details. We then resize each image to 46 × 56. Fig. 5 shows some normalised images from ORL.

Experiment on the AR
For AR, the first one, two, and three images per class are, respectively, selected for training. Table 1 shows the average recognition rates (%) of linear regression classification (LRC) [24], CRC [5], SRC [4], nearest neighbour (NN) [25], CIRLRC [17] and BCRC-CNN under different weights. In most cases, BCRC-CNN classifies more accurately than the other methods. For instance, when the number of training samples of each subject is three, the classification accuracies of LRC, CRC, SRC, CIRLRC and the proposed method with w 1 = 0.6, are 68.71, 78.94, 80.30, 76.82, and 88.94%, respectively.
To better evaluate our methods under the effect of variations in appearance in the AR database, we divide the samples from the same subject into three situations: (i) the first and eighth images are selected as the natural faces taken in two periods. (ii) the secondfourth as well as ninth-11th samples are selected for expression changes. (iii) the fifth-seventh and 12th-14th samples are selected for illumination changes. In each round of the experiment, we select the corresponding images per class for testing samples and the rest are training samples. Fig. 6 shows the classification results by means of CIRLRC and BCRC-CNN with w 1 = 0.6 over the variations on the AR database. As we can see from Fig. 6a that the proposed method shows the robustness to the expression and illumination changes. Also, the recognition rates of BCRC-CNN under different situations are all above 90%. Furthermore, the great experimental result of BCRC-CNN for illumination classification indicates that the proposed bidirectional representation learning model can improve the classification performance. The running time(s) of both methods is also given in Fig. 6b. It is necessary to emphasise CIRLRC took more time to complete the process. For example, when there are illumination changes, the consuming time (s) of CIRLRC is 18,989.16 s, while the proposed method only needs 733.09 s which is 96.13% faster than the previous algorithm. Since the proposed method develops an efficient bi-directional representation model using both the forward representation and the reverse modelling based on CNN features instead of calculating the score and the distance such as CIRLRC. As analysed above, these figures strongly prove the effectiveness of BCRC-CNN.

Experiment on the FERET
For FERET, θ (θ = 2, 3, 4) images of each class are randomly selected as training samples and treat the remaining ones for testing with respect to 200 different persons. Table 2 shows the comparison of BCRC-CNN with LRC [24], CRC [5], SRC [4], NN [25], two-phase test sample representation (TPTSR) [26] and CIRLRC [17] through a different number of training samples. The best recognition rates of BCRC-CNN are 93.60, 93.75, and 93.83% with θ (θ = 2, 3, 4) training samples. From the experimental results, we can obverse our method achieved higher classification rates but consumed less running time.

Experiment on the ORL
For ORL, θ (θ = 1, 2, …, 9) images of each class are randomly selected as training samples and treat the remaining ones for testing with respect to 40 different persons. Fig. 7 illustrates the comparison results of BCRC-CNN and LRC [24], CRC [5], SRC [4], NN [25], CIRLRC [17] with different number of training samples. The proposed algorithm performs best among all methods and the recognition rates of BCRC-CNN are nearly 100% in most cases. Furthermore, it is worth stressing that the curve of BCRC-CNN is always flat as the increase in the number of training samples, implying effectiveness and robustness of BCRC-CNN.

Conclusion
In this study, we proposed a BCRC-CNN for face classification. The critical innovation is an efficient hybrid RBC model in the CNN feature space, which is used to accomplish face recognition. Meanwhile, the residuals from the forward representation and the reverse modelling are integrated with the minimum cost for robust classification. The strength of the novel scheme in this study lies in successfully utilising the auxiliary residual information from the test sample, and enhancing the recognition rate, across a variety of appearance variations. The experimental results demonstrate the effectiveness of BCRC-CNN adequately. We believe that this kind of novel RBC algorithm will successfully improve representation capabilities in image recognition in the future.