Transfer Learning Technique with VGG-16 for Near-Infrared Facial Expression Recognition

In this paper, we investigate a deep learning vgg-16 network architecture for facial expression recognition under active near-infrared illumination condition and background. In particular, we consider the concept of transfer learning whereby features learned from high resolution images of huge datasets can be used to train a model of relatively small dataset without loosing the generalization ability. The pre-trained vgg-16 network architecture with transfer learning technique has been trained and validated on the Oulu-CASIA NIR dataset comprising of six (6) distinct facial expressions, and average test accuracy of 98.11% was achieved. The validation on our test data using the confusion, the precision, and the recall matrix reveals that our method achieves better results in comparison with the other methods in the literature.


Introduction
The identification system for human facial expressions has experienced huge development since its invention and numerous techniques and methods are widely utilized. For this reason, further research on FER is fully in place to provide constant improvement over accuracy of identification. Facial expressions are one of the powerful and natural means for human being to present their intensions and emotions [2]. Emotion recognition perhaps is an extensive, complex, and useful research topic in many fields such as biomedical engineering, health, education, neuroscience, psychology [3] and so on. The ability of computers to differentiate, understand human expression, and take appropriate action is one of the key research areas particularly in human computer interaction (HCI) [4]. Suppose that during online studying, student acceptance of material will rise if the computer knows the state of a student expression. With the knowledge of a student expression, the computer can provide appropriate learning in line with the conditions of the student expression at that particular time [4]. Another useful application is that through an expression recognition system, lies can be detected during crime interrogation [6]. Emotion detection in the area of biomedical engineering focuses on predicting human expression, and computer-assisted diagnosis of psychological disarrays [3]. There are several ways in the recent past to detect emotional states of a person or individuals such as EEG, speech analysis, galvanic skin response (GSR) and so on [3]. Applications can be extended to missions involving patients with Autism, very aged, newborn babies, and so on, who will not be able to express their expressions clearly [4].

Related work
The study of transfer learning is inspired by the fact that human beings can quickly apply knowledge learned in the past to solve new tasks faster or even provides better remedies [11]. The main motivation, and the primary goal for transfer learning tasks in the field of machines specifically machine learning was discussed in a paper written by Pan, Sinno Jialin, and Qiang Yang (2010) [12]. This paper describes the need for lifelong machine learning architectures that retain and reuse knowledge learned in the past. A learning approach to transfer learning is the multi-task learning method as postulated by Caruana, R [13], which tries to learn several tasks at the same time, even when the tasks are not the same. Machines which are used for emotion prediction either by facial expressions only or by vocal sounds are postulated by Albert Mehrabian and Ekman et al. [14,15]. Gil Levi and Tal Hassner used Mapped Binary Patterns with CNNs to counter illuminations variances and trained an ensemble of vgg-16 networks on CASIA-Webface dataset [16]. In summary, it can be seen that CNN architecture gives optimistic results for classification tasks and can be used for FER application like ours. To reduce the computational complexity and improve accuracy as well, transfer learning can be used in which knowledge obtained from the pre-trained network architecture trained with high resolution images of huge dataset can be transferred to learn on a newly smaller dataset without loosing the generalization ability.

Transfer learning with VGG-16
Transfer learning is simply a machine learning technique where the knowledge obtained from the previous task can be applied to another related task and at the same time improves the learning  [17]. The CNN network architecture like ResNet, VGG, AlexNet and so on, are already trained on a huge image dataset of ImageNet comprising more than one million labelled high resolution images belonging to one thousand (1000) categories. Thus, the knowledge obtained already from a particular task is assigned to learn a new different task. It is especially used where the training data is relatively small. In addition, it shows good performance, especially during classification tasks and the computational complexity is significantly minimized to some extent as the entire operation need not start from the scratch. VGG-16 network architecture is a sixteen (16) layer network used by the VGG group at the University of Oxford to obtain outstanding results in the ILSVRC competition held in 2014. The main feature of the vgg-16 network architecture was the increased depth of the network. It is known for its outstanding performance in several classification tasks like image, object and so on [18]. It was trained on ImageNet dataset.  Table 1 shows the architecture of vgg-16 network. Moreover, it uses thirteen (13) convolutional layers and three (3) Fc layers (dense layers). The convolutional layers in vgg-16 network architecture have kernel size, of 3×3 with a stride size of 1 and the padding is the same. The pooling layers are all 2×2 layers with a stride size of 2. The input size of the network is 224×224 pixels. After each pooling layer, the size of the feature map is decreased by half (1/2). The last feature map of the network architecture before the Fc layers is 7×7×512 channels. The layer is expanded into a vector with 25,088 (7×7×512) channels respectively. The last layer of the network is a softmax layer that outputs class probabilities. A static feature representation is mainly used where all the images are represented with the same size. It is one of the most effective and efficient used methods to apply transfer learning to image recognition problems and for transfer learning, a pre-trained network architecture of vgg-16 was used with implementation in tensorflow [19].

Methodology
Our proposed method aims to detect the emotional state of individuals under active near-infrared illumination condition based on their respective facial expressions manifested in photos. However, It has been seen that features of the previous layers of a DCNN architecture usually contains colour, and edge information. In contrast, the later layers, contain attributes more specific to the details of the categories. For this reason, the parameters of the previous layers need no or minimal fine-tuning [20]. Inspired by this, in our proposed work, we only fine-tuned the dense layers of the vgg-16 network architecture which are the fully connected layers. The description for transforming the vgg-16 network layers is illustrated as follows; The network is trained on more than a million (1,000,000) high resolution images. It has the capability of classifying images into a thousand categories [21]. The last three (3) Fc layers of the network is designed for these one thousand categories. For an expression recognition task like ours, these layers need to be fine-tuned [22,23]. The procedure for fine-tuning the network is to extract the entire layers of the network other than the last three (3) dense layers. Move the layers to the new task and insert the new dense layers, a softmax function, and a classification output layer. In our proposed work, the channel size of the Fc layers are 128, 256 and 512 and the number of classes are set to be six (6). To this end, this part of the paper will describe the training techniques, image preprocessing steps, and the dataset used.

Facial expression dataset
A dataset has to be designed or acquired, such that it is sufficiently apt to train the CNN model [19]. The dataset comprises set of pictures of faces with various expressions [19]. One of the extensive datasets found to be suitable for the purpose of emotion recognition and as well in this research work is the Oulu-CASIA NIR dataset [1]. It consists of six (6) basic facial expressions (anger, happiness, sadness, fear, surprise, and disgust) from eighty (80) subjects between 23 and 58 years old, and 73.8% of the subjects are males. Each and every expression are captured in three different illumination conditions namely; normal, weak, and dark. Normal illumination indicates that good normal lighting is used, weak illumination means that only computer display is ON and dark illumination means near darkness [1]. In our proposed work, we choose only nine thousand five-hundred and four (9,504) near infrared images from normal illumination as shown in figure 1. These facial expressions have been classified as follows; one equals anger, two equals disgust, three equals fear, four equals happy, five equals sad and six equals surprise. In addition to the image class number (a number between one, and six), the final custom dataset contains 9,504 NIR images, and the data was randomly divided into training, validation and testing sets as shown in table 2 and 3 respectively. It was found in this experimental research work that Oulu-CASIA NIR dataset gave better performance for our proposed method.    Figure 2. Image preprocessing steps; face detection, crop and resize.

Face detection, crop and resize.
The process of face detection which is sometimes called face registration is a method of detecting the face portion (region of interest) from a given image [25]. OpenCV LBP algorithm [24] has been used to detect these faces from the images. After this process, the face portion of the image is being cropped out to avoid background complexity, so that it will be very sufficient for the model to train data efficiently. The cropped images were finally resized into 224×224 pixels as shown in figure 2.

Training procedures
The data and their respective labels were imported from Oulu-CASIA NIR dataset. They were shuffled and split into training, validation and testing sets in the ration 80:10:10. The training and the validation data was used during the training process. Moreover, the training data was specifically used to train the values of the network parameters while the validation data was used to measure the accuracy of the network after each epoch to prevent over-fitting. The model was tested with the unknown test set immediately after the training was completed, and the performance of the network was measured accordingly. In addition, the trained model was saved after the whole training is completed so that it can be used for future predictions if the needs arises. The architecture used is composed of two stages: (1) preprocessing and (2) feature extraction and classification. Here, the proposed architecture uses transfer learning technique with vgg-16 network to extract features and feed its output to a softmax classifier to classify those expressions. Firstly, the region of interest (face portion) is extracted from the given face images by using LBP algorithm implemented by OpenCV to detect the frontal face. Then from the cropped face regions, the image was resized into 224×224 pixels as shown in figure 2 and 3 respectively. Secondly, we included weights equals 'imagenet' to download vgg-16 network architecture which was trained on a huge dataset of ImageNet and at the same time, we also set include_top equals 'False' to skip the Fc layers of the pre-trained network from downloading. Then after this process, the bottleneck features for training, validation and testing sets were created from the vgg-16 ImageNet model respectively. The input shape of the bottleneck feature of each image which comes after passing the image through vgg-16 is 7×7×512 with batch size equals 64. Feature vectors were extracted from the proposed vgg-16 network architecture and the weights of all the five (5)  after the last vgg-16 convolutional layers which comprises three (3) layers 512, 256 and 128. The output from the network is given to a new classifier and the images were classified into one of the six (6) basic facial expressions using the softmax function as shown in equation (3). The proposed model that was used to classify these six (6) basic facial expressions contains 13 convolution layers of vgg-16 network architecture as illustrated in table 1. Convolution over an image f(x, y) using a filter w(x, y) can be calculated as shown in equation (1). The activation function that was used in the hidden layers is the rectified linear unit activation function (ReLu).
Moreover, relu is applied to introduce the non-linearity of a model [26] because it is known to better transmit the error than the sigmoid function [27]. Equation (2) depicts ReLu activation function.
The network has been fed with 224×224 pixels sized images as default input images. There are three (3) Fc layers of size 128, 256 and 512 nodes right after vgg-16 convolutional layers. Like convolution layer, the relu activation function has also been applied in those layers and then after each hidden layer, we inserted a dropout layer and the value of our dropout layers are set to 0.25. It randomly deactivates 25% nodes from the hidden layer to avoid over-fitting the network [28]. At the last stage, the output layer of the network comprises six (6) nodes as it has six (6) classes and the activation function that was used in the output layer as shown in the equation three (3) is the softmax classifier. Adam has been used as the model optimizer while categorical cross entropy was used as the network loss function. Figure 4, shows the proposed architecture that has been designed for this very research work. images; feature vectors were extracted from the vgg-16 network as the weights of all 5 convolutional blocks were frozen. The output from the network was given to a new classifier and classification was carried out with the softmax function as it has six (6) classes (happiness, sadness surprise, anger, disgust, and fear).

Experimental results and discussions
This part of our paper gives the details of the classification results on Oulu-CASIA NIR dataset. The prime aim and objective of this very research work is to identify the expression of the human face given an image of that person. Because of FER applications in the real time world, it has become so famous and popular in the recent past years. This research work proposes a model for facial expression recognition task that makes use of transfer learning technique with vgg-16 model. To prove the efficiency of our proposed model, the Oulu-CASIA NIR dataset was trained on the model and the results show good recognition rate across all the six (6) classes as shown in figure 7, 8 and 9 respectively. For the model performance evaluation, we adopt accuracy, confusion matrix, precision matrix and recall matrix as the evaluation index. A confusion matrix is actually a kind of table layout that enables us to calculate, compute and also allows visualization of the performance of an algorithm. Confusion matrix is also known as an error matrix [29] and it is mostly used in machine learning and specifically the problem of statistical classification. Each and every column represents the instances in an original, or actual class while each row of the matrix represents the instances in a predicted class (or vice versa) [30].      Precision matrix and recall matrix on the other hand are measurements for the accuracy of information retrieval, identification, and classification within a computer program. In short, precision and recall matrix are measurements of relevance. The proposed model is carried out with Keras library using python programming language, and Adam optimization algorithm is used. In addition to Adam optimization, the loss function that was used in the research work is the categorical cross entropy and the results obtained are given in figure 7, 8 and 9 respectively. Even though the Oulu-CASIA NIR dataset has been trained on our proposed network, it has successfully achieved an average accuracy of 98.11% and has also succeeded to sustain higher, nearly equal classification rate for every class as it delivers a firm classifications output as shown in figure 10, 11, 12, 13, 14 and 15 respectively. It can be seen from the results that our proposed approach of FER using vgg-16 pre-trained network with transfer learning technique successfully achieves better recognition accuracy compared to the other existing models. The training and validation progression curve demonstrating the loss, and also the accuracy values after each epoch is given in figure 5 and 6. Analyzing figure 5 and 6, it is evident to see from these figures that our proposed network reaches 98.11% accuracy within 100 epochs. Our proposed method improved the accuracy of network to some extent. Figure 7, shows the confusion matrix, figure 8, shows the precision matrix and figure 9, shows the recall matrix for the six (6) classes. Figure 10, 11, 12, 13, 14 and 15 shows the classified results from Oulu-CASIA NIR dataset.

Comparison with other existing methods
Our method and other existing methods of facial expression recognition with the same dataset under normal illumination condition was compared. The classification test accuracy achieved by our method under normal illumination condition for FER task are comparable with the existing methods in the literature. The average test accuracy for each class is listed in table 4. The prime advantage of our proposed model in comparison with other methods in the literature is that, there is no need for feature extraction method and as well as intermediary feature selection phase.

Conclusion and future work
In this research paper, we propose a transfer learning technique for facial expression recognition system under active near-infrared illumination condition. The pre-trained vgg-16 DCNN network architecture which was trained on a huge dataset of ImageNet is used to initialize the weights and therefore, the entire network is trained on the input FER image dataset. This network with transfer learning technique was successful in producing a test accuracy of 98.11%. Several experiments were carried out on the Oulu-CASIA NIR dataset and test accuracies were measured. The results obtained from these experiments show that our method gives better classification results in comparison with other methods in the literature. In addition, we believe that more robust results can be achieved by investigating other pre-trained DCNN network architecture with transfer learning technique for the FER task. This can help us to compare their effectiveness, efficiency, and computational complexity. Our future research work will centre mainly on this research direction.