Deep Residual Feature Quantization for 3D Face Recognition

3D face recognition (FR) has been successfully applied using Convolutional neural networks (CNN) which have demonstrated stunning results in diverse computer vision and image classification tasks. Learning CNNs, however, need to estimate millions of parameters that expect high-performance computing capacity and storage. To deal with this issue, we propose an efficient method based on the quantization of residual features extracted from ResNet-50 pre-trained model. The method starts by describing each 3D face using a convolutional feature extraction block, and then apply the Bag-of-Features (BoF) paradigm to learn deep neural networks (we call it Deep BoF). To do so, we apply Radial Basis Function (RBF) neurons to quantize the deep features extracted from the last convolutional layers. An SVM classifier is then applied to classify faces according to their quantized term vectors. The obtained model is lightweight comparing to classical CNN and it allows classifying arbitrary-sized images. The experimental results on the FRGCv2 and Bosphorus datasets show the powerful of our method comparing to state of the art methods.


I. INTRODUCTION
Face recognition in unconstrained scenarios has achieved impressive prosperity due to the explosion of CNNs compared to previous methods using hand-crafted feature extractors, like Local Binary Pattern and SIFT [1]. Hence, deep learning has been found to be more robust to classify faces in the presence of many variations like expression, rotation, and scale. Their outstanding performance has been shown in challenging classification tasks of a large amount of labeled image databases such as ImageNet [2]. In order to minimize the number of parameters in the network, CNN can be used with other models such as the Bag-of-Features model [3]. We have structured the remaining parts of the paper as follows, Section II gives an overview of the related works, and Section III describes the method. The experimental findings on FRGCv2 and Bosphorus datasets are presented in Section IV. Conclusions end the paper.

II. RELATED WORKS
Various methods have been suggested in the literature to minimize the model size of CNN, with the aim of improving CNN-based image classification methods. We may divide them into three groups: a) Global pooling strategies: makes it possible to deal with arbitrary size as Spatial Pyramid Pooling, for example, He et al. [4] applied this technique which can produce a fixedlength representation and provide more robustness to object deformations. Malinowski and Fritz [5] have suggested a pooling operator parameterization. They then analyzed the impact of different regularizerson the pooling regions. Lastly, approximations to the proposed model are applied to allow efficient training. b) Compression and pruning techniques: they focus on compressing an already trained CNN. They are also unable to manage images of varying sizes, resulting in a fixed due for feed-forwarding the CNN layers. As an example, Chen et al. [6] proposed a frequency-sensitive hashed nets that exploits the accrued tautology in the CNN layers. Thus, they realized high savings in memory and storage consumption. Han et al. [7] applied a pipeline pruning, quantization and Huffman encoding simultaneously. (1) Pruning the network by learning only the more relevant connections. (2), Quantization of the weights is applied to enforce weight sharing to be able to apply Huffman encoding. Further, to be able to deploy CNN on different systems with insufficient computing power, Wu et al. [8] proposed a quantized CNN framework, to jointly accelerate the computing power and reduce storage requirements. c) Feature vector quantization techniques: the classical BoF paradigm aims to extract a number of segregated patches from the images, sampling a representative set of patches from the image, then giving a geometrical or visual descriptor vector, and using the resulting distribution of samples as a characterization of the image. BoF paradigm has been widely employed to quantize shallow features [9]. When dealing with trainable convolutional layers, BoF is considered as a pooling layer and makes it possible to classify different sized-images [10]. III. METHOD Fig. 1 depicts a summary of our FR proposal. The first step is the preprocessing. We apply different filters to remove facial deficiencies and noise (smoothing filter and median filter). Also discarding the unnecessary parts of the body is needed via the cropping filter. Next, each 3D face model is transformed to 2D depth image which takes into account the gray value of each image pixel in the following step. All 2D faces are normalized to 224 × 224 pixels which is the input size of the CNN models. After this stage, we extract deep features using ResNet-50 pre-trained network from the last convolutional layer as follows:

A. Residual feature extraction layer
We extract residual deep features using ResNet-50 model [11]. It is a variant of ResNet pre-trained model on ImageNet dataset which has 48 Convolution layers along with 1 MaxPool and 1 Average Pool layer. It has 3.8 × 10 9 Floating points operations. Fig. 2 shows in detail the architecture of this model. We only employ the Residual features shown in the same Figure. These features will be utilized in the next layer computation.

B. Deep BoF layer
Using the obtained residual layer mentioned earlier, we extract feature maps (FMs) from the i th image. We used the RBF kernel as a similarity measure to estimate the similitude among these features and the codewords, also known as term vector as proposed in [12]. As a result, the first sublayer would be made up of RBF neurons, each of which is assigned to a codeword. The number of feature vectors acquired from the i t h image is denoted by p i . The initial setting of the RBF neurons can be done manually or automatically while creating the codebook. K-means is the most widely used automated algorithm. Let P denote the set of all the feature vectors, identified by: P = {p ij , i = 1 . . . p, j = 1 . . . p i } and p k denote the number of the RBF neurons centers referred by n k . It's important to note that these RBF centers are learned afterward in order to obtain the final codewords. After that, the quantization is used to retrieve the histogram with a predetermined number of bins, each of which is referred to a codeword. The RBF layer, which has two sublayers, is then utilized as a similarity measure.
(1) RBF layer: determines how close the probe faces' input features are to the RBF centers. The j th RBF neuron φ(X j ) is determined by: In this equation, x denotes the feature vector where n j refers to the center of the j th RBF neuron.
(2) Quantization layer: this layer aims to combine the result of each neuron of RBF. We then obtain a global histogram that will be employed in the following for the classification stage. It is given by: Where φ(p) refers to the yielded vector from the RBF layer through the n k bins.

C. Dense layer and classification
The classification of the faces is carried out using an SVM classifier, with each face identified per a term vector. The backpropagation method is applied to train the deep BoF network using gradient descent. The 10-fold cross-validation method is carried out in our proposal on FRGCv2 and Bosphorus datasets. We refer to the term vector of each face by P = [p1, . . . , p k ], where each p i denotes the number of the term i in the given face. The codeword P is used to identify test faces.

IV. EXPERIMENTAL RESULTS
The proposed approach was evaluated using the FRGCv2 [13] and Bosphorus [14] datasets. We used Matlab R2017b on Win 7 with i7 CPU.

A. Datasets description
FRGCv2 dataset: is a very challenging dataset for FR task. It includes 4007 3D face scans of 466 different subjects. The scans have been acquired in unconstrained scenario. The subjects are of different gender and age.
Bosphorus dataset: consists of 4,666 images attributed to 105 subjects. In addition to Expression variations, pose variations and occlusions are available in this dataset which make it more challenging than FRGCv2.

B. Results and discussion
1) Setting: The FRGCv2 and Bosphorus faces has been adjusted as presented in the Section III. Using the normalized 2D depth faces of three sizes (i.e. 200 ×200, 224 × 224, 250 × 250 pixels), we use ResNet-50 pre-trained model to extract Residual features as presented in Section III-A. In this layer the FMs are of 7 × 7 dimension with 2048 channels. The global histogram is then extracted using the quantization-based method as shown in Section III-B. Finally, An SVM classifier is employed to attribute each face to its possible identity as presented in Section III-C. In this experiment, to validate the model, the 10 fold cross-validation technique is used, mini batches of 50 along with 50 iterations are applied. Note that all the steps of the proposed method are fully automatic. Note that we have followed the same protocol applied in [15] in order to make a fair comparison.
2) Results on FRGCv2 dataset: Table I shows the classification rates on the FRGCv2 dataset. Three different dimensions of codebooks are tested including fifty, sixty and seventy term vectors per image. It can be noticed that using the third FMs in the fifth block layer with sixty term vectors, we get the highest recognition rate by 98.9%.
3) Results on Bosphorus dataset: The classification rates using 3 codebooks on the Bosphorus dataset are shown in Table II. We can see that the highest performance has been achieved is 97.3% using the third FMs. Table III and IV present the comparison between the recognition rate of the obtained method and state of the art accuracy on FRGCv2 and Bosphorus datasets respectively. We can see that our proposed approach outperforms handcrafting and classical CNN-based methods. This accuracy is achieved due to the Residual deep features extracted from the ResNet-50 model and their effective quantization.

V. CONCLUSION
A new methodology using a deep bag-of-features paradigm for 3D face recognition is proposed. Between the last convolutional block and the dense layer, a BoF-based pooling layer is used instead of a traditional CNN that feeds the extracted vectors directly into the fully connected layer. Therefore, we extract a global histogram quantization from the residual features of the ResNet-50 model. The obtained representation is independent of the image size and reduces considerably the number of parameters that we can find in a classical CNN. High recognition accuracy is achieved. Note that our method is generic so that other 2D map images can be used (e.g. Shape index and curvature maps). As a limitation, our method is based only on residual features which cannot provide a high generalization power in the presence of many variations. Hence, other pre-trained models can be added to the framework (e.g. Alexnet, ExpNet).