Identity preserving multi-pose facial expression recognition using fine tuned VGG on the latent space vector of generative adversarial network

: Facial expression is the crucial component for human beings to express their mental state and it has become one of the prominent areas of research in computer vision. However, the task becomes challenging when the given facial image is non-frontal. The influence of poses on facial images is alleviated using an encoder of a generative adversarial network capable of learning pose invariant representations. State-of-art results for image generation are achieved using styleGAN architecture. An efficient model is proposed to embed the given image into the latent vector space of styleGAN. The encoder extracts high-level features of the facial image and encodes them into the latent space. Rigorous analysis of semantics hidden in the latent space of styleGAN is performed. Based on the analysis, the facial image is synthesized, and facial expressions are recognized using an expression recognition neural network. The original image is recovered from the features encoded in the latent space. Semantic editing operations like face rotation, style transfer, face aging, image morphing and expression transfer can be performed on the image obtained from the image generated using the features encoded latent space of styleGAN. (cid:1838) (cid:2870) feature-wise loss is applied to warrant the quality of the rebuilt image. The facial image is then fed into the attribute classifier to extract high-level features, and the features are concatenated to perform facial expression classification. Evaluations are performed on the generated results to demonstrate that state-of-art results are achieved using the proposed method.

Latent space vectors are poorly explored in existing researches. In contrast, this work explores the latent space vector of styleGAN to synthesize the non-frontal facial image retaining its expression and identity. The latent space concept has enormous uses in deep learning, from learning the features of the data to simplifying the representation of data for finding the data patterns. All the important information necessary to represent the data is hidden in the latent representation. In other words, the model learns features of data to simplify the representation for easier analysis. It makes it easier to understand the data points' structural similarities or patterns by analyzing data in the latent space. As the latent representations have the data in its compressed form and carry only the important information, the processing is faster when compared to other classical approaches. Enough research was not performed on connecting the latent representations of images to semantic attributes for editing the image. The proposed model interprets the latent space of styleGAN, which is trained for face synthesis. This work is split up into two phases wherein the synthesis of facial image is performed using a welltrained GAN latent space vector. The second phase involves modeling a neural network for facial expression recognition. This work (i) Explores the latent space of styleGAN and identifies the relationship between the latent representations and semantic attributes of the output. (ii) Rigorously analyses the capability of GANs to map the latent vectors to high-resolution images. (iii) Presents an efficient model for generating non-frontal facial images preserving the identity and expression of the face. (iv) It provides insight to the researchers about how a random distribution is mapped to a high-quality semantic image and how to interpret the semantics of latent space and use the latent space vectors for various applications.
The remainder of the paper is organized as follows. In Section II, existing works relevant to emotion recognition and GAN are presented. In Section III, the architecture of the proposed method is described. In Section IV, experimental results are discussed with performance analysis. Finally, in Section V, the paper is concluded with a discussion about future work.

Related works
Existing works analyses on generating high-resolution images from ground truth [27][28][29], however, very few works exist on analyzing the capability of GANs concerning latent space. Radford et al. [30] were the first to propose that GANs learn various semantic attributes in the hidden latent space. Mirza M et al. [31] proposed a model to generate images using disentangled latent vectors and labeled attributes. This model is extended with a customized loss function and semantic attributes to improve the synthesis quality [32][33][34]. Arvanitidis et al. [35] proposed a model to vary the output smoothly through latent space interpolation. Some works were also performed in the reverse direction by generating the latent space from the image space [36,37]. Wang et al. [38] performed facial expression recognition using an unsupervised domain adaptation method with four datasets, namely FER2013, CK+, MMI, and JAFFE. Seven different emotions, anger, happy, fear, sad, disgust, surprise and neutral, were recognized using the developed model. This model for expression recognition was built using Alexnet and VGG11. The facial images from CK+, MMI, and JAFFE datasets are cropped, while the images from FER2013 were resized to 224 × 224 as the original images were too small, measuring 48 × 48. Stochastic gradient descent was used during training. Zhang et al. [39] proposed a feature learning model based on DNN for facial expression recognition. The proposed method extracted features from facial images using SIF method. A feature matrix is arrived using the extracted SIFT features and passed as input to DNN model for expression classification. The DNN model explores the relationship between the SIFT and their semantic features. The proposed model learns the features corresponding to facial expressions.
Yang et al. [40] extracted features suitable for classifying facial expressions using a weighted mixture DNN model. Rotation rectification, data augmentation, and face detection are implemented on the input data. The proposed model processed the grayscale images and LBP facial images. Features of the images are extracted using the VGG16 model. Features of the LBP image are extracted using CNN. The models' outputs are combined in a weighted manner, and the classification is performed using softmax. Kim et al. [41] developed a facial expression recognition system using deep hierarchical learning. The proposed model utilizes two networks: appearance feature-based and geometric featurebased networks, to extract holistic features and coordinate action units. An autoencoder is designed to generate a neutral expression facial image. Dynamic facial features are extracted between the emotional and neutral expression facial image. Zhang et al. [42] used a deep identity network for identifying faces. A deep learning framework based on local facial patches and multi-scale global images was proposed for facial expression recognition. The proposed model localized the foreground image from the background image. Face part patches are generated with local and global identity information. The generated face patches are fed into CNN to perform facial expression classification.
Ferreira et al. [43] proposed a DNN architecture with loss functions based on the fact that expressions are associated with facial muscles' movement. The loss function regularizes the learning process to make the proposed DNN learn features that are specific to an expression. The model identifies the face components, namely nose, eyes, eyebrows and mouth and expression wrinkles to recognize the facial expression. Also, the model is also capable of learning expression-specific features and facial relevance maps. González-Lozoya et al. [44] improved generalization in facial expression recognition by fusing the instances extracted from different facial databases. The proposed method is capable of recognizing micro-expressions. Facial expression recognition is performed using face detection from the facial image, facture extraction using CNN and modeling. In a nutshell, the proposed model is a prototype system for facial expression recognition and micro-expression recognition for analyzing videos.
Deng et al. [6] proposed a conditional GAN-based approach. The proposed approach individually controls the facial expression. It simultaneously learns the generative and discriminative representations. Similarly, Cai et al. [23] proposed a Condition GAN-based approach to reduce the inter-subject variations for expression recognition from facial images. Yang et al. [45] proposed a feature separation model for facial expression recognition tasks. The feature separation is achieved through partial feature exchange and various constraints. Liong et al. [46] proposed four steps: facial landmarks annotations, optical flow guided image computation, feature extraction, and emotion class categorization. Here, GAN is used to perform data augmentation to generate more image samples. Wu et al. [47] proposed a Cascade Expression Focal GAN to perform progressive facial expression editing with local expression focuses. This approach preserved identity-related features and details around the nose, eyes and mouth.
The current work extracts the pose component, identity component and expressive component from the facial image. The extracted expressive component is used to perform facial expression recognition. This work exploits the latent representation of the facial image to analyze the semantic contents of the image. The model identifies the relationship that exists between the latent vector and the semantics of the image. GAN-based face synthesis is performed by controlling the facial attributes and preserving the identity. A new approach is proposed for performing emotion classification based on GAN and fine-tuned VGG19 model. The emotions are classified into seven classes namely anger, fear, sad, happy, disgust, surprise and neutral. Given any multi-pose facial image, the latent vector is obtained and passed through a generator to generate the facial image. Facial features are extracted from the generated image and the non-frontal facial image. The difference is calculated as loss, and gradient descent is applied to reduce the loss. The gradients are backpropagated and images are finetuned until the image very close to the input image is obtained. The facial image is then passed through facial expression recognition neural network to perform emotion classification. The expression recognition neural network is a deep CNN model which extracts high-level facial features and predicts the output as a probability of seven classes. The features extracted from the latent representation are concatenated with features extracted by the deep CNN model and facial expression classification is performed.

Materials and methods
When a generative model is trained on a dataset, the model discovers the data's underlying structure. Given that the model has discovered the underlying structure, it can be utilized to perform a variety of applications. This work explores the extent to which the latent space interpolation can navigate the visual world, like manipulating an image of a female to look like a male, making an image with neutral expression to smile, face aging and more. The basic science behind encoder-decoder is that an encoder encodes the pixel space into the latent vector space. The decoder decodes the available information from the latent space vector to rebuilt the actual input. The latent space contains the actual input in a compressed version of the actual data at a lower dimension when compared to the pixel space. It has only the information that is required to reconstruct the actual input from the latent space vector. A generator in the GAN architecture exploits the latent space and maps the latent vector to the output [48]. Mapping performed by the generator varies for every epoch. By using the random points in the latent vector space, the generator generates a new image. Figure 1 shows the mapping network that maps the latent space vector to another intermediate vector fed to the image synthesis network.
This work involves exploiting the latent space of styleGAN trained on Flickr Faces High-Quality Dataset (FFHQ). Vector arithmetic operations are performed on the points in the latent space to generate images. The random vector from the latent space is passed as input to the generator model. The size of the latent space and the points are the input samples for the generator. The generator model returns the generated images as output. The number of epochs required for training is arbitrary and it can be increased if the quality of the images generated is to be improved.
Traditionally the generator gets a random noise vector as input. The random noise is fed into a bunch of up-sampling networks until the desired image is generated. In contrast to the traditional approach, the styleGAN generator has a mapping network as shown in Figure 1, which takes a random sample ∈ as the input and transforms into an intermediate vector called ∈ . The disentanglement observed at the space is much stronger than the disentanglement observed at space. Unlike , is not restricted to a specific distribution and it can better understand the underlying features of the real data. Since the disentanglement in space is much superior to space, attribute editing is far better with space. The distribution of vector is not required to be Gaussian, and rather it can be any other distribution. Hence, the actual generator architecture of styleGAN does not start with a random noise vector, rather starts from a constant vector. This constant vector is optimized during the training. The vector is plugged into multiple layers of the generative architecture using a blending layer called AdaIN. During training, in addition, the vector noise is also added to these parameters. The general principle of a generative model is that its latent space learns the underlying structure. The structure learnt by the generative model is unsupervised during the adversarial training process. In order to leverage this structure, the image in the latent space is manipulated instead of manipulating in pixel space. Manipulating the image in pixel space is complicated. To simplify this, the image in latent space is manipulated. The latent vector is determined in the latent space to perform this manipulation for a given query image. Two different methodologies can be adapted to determine the latent vector. 1) Given that the generator model is a fully differentiable neural network, a random latent code is passed through the differentiable generator and generated images. The generated image is compared with the query image by calculating the loss 2, which is the pixel difference of the two images. The gradients are backpropagated through the generator and update the latent vector at the generator model's beginning. By applying gradient descent on the pixel lose 2, the optimal latent vector is generated. But, considering the 2 pixel loss alone will generate an image that is not very close to the query image. The optimization may get stuck in the local minima. To overcome this issue, a trained classifier is used as a lens to look at the image. Both the generated image and the query image are sent through a trained VGG network that was trained to classify ImageNet images. Instead of traveling through the entire VGG network until the classification, the feature vectors are distilled from the fully connected layers. These feature vectors give a high-level semantic representation of the facial image content.
2) Sampled random vectors are passed through generators to generate faces. With the dataset generated, a ResNet (Residual Network) is trained to obtain the image's latent code. Given a query image, it is passed through a ResNet model, which gives an initial estimate of the latent vector in the styleGAN network. This latent vector is taken and passed through a generator to generate an image. On the generated image, a pre-trained VGG network is used to extract features from the image. Similarly, the VGG network is applied to the query image, and high-level features are extracted from the query image. Loss 2 is calculated in the VGG feature space. L2 distance is minimized in the feature space using gradient descent. The gradients are then back-propagated through the generator network until the latent code. During this optimization process, the generator weights are fixed. Only the latent code at the input end is updated. Finally, an optimized image is generated, which is very close to the query image.
This work adopted the second approach to obtain the latent code. The flowchart for the overall approach is represented in Figure 2. The latent vector is sampled and pass it through the generator to obtain the image. A classifier is applied on the generated on the facial image generated to extract the attributes. A syleGAN latent space has 512 dimensions. Figure 3 shows the schematic illustration of facial expression recognition of a non-frontal facial image. Given a facial image with expression, the corresponding identity, pose and expressive components are extracted through an encoder. The extracted components are concatenated and sent to the decoder. This is performed to distill the expressive component from the facial image to classify the expressions. Facial expression recognition is performed in two phases. The two phases are separated by the dotted rectangular box in the schematic illustration. The first phase involves determining the latent vector of the given query image. The second phase involves extracting high-level facial attributes and facial expression recognition using an expression recognition neural network. Six different basic emotions, namely happy, sad, angry, fear, disgust, and surprise, are recognized. Facial images with no expression are classified as neutral. The drawback with the current work is the proposed model does not handle the images in a noisy environment. The future work may explore facial expression recognition on an unconstrained expression dataset with a noisy environment and explore the real-time applications of facial expression recognition. The overall framework involves extracting the attributes such as pose, expression and identity. Let be the input sample, be the encoder and be the decoder. The encoder and decoder are built with multiple convolutional layers to map the attributes into the latent vector and to recover the image back from the latent vector, which is represented as, Where, is the input sample, is the reconstructed image, is the encoder that encodes the given input to the latent space, is the decoder that decodes the latent vector back to the original input and is the latent space vector.
In terms of latent space, the ultimate goal is to maneuver latent space to achieve a given image's transformations. The model generator is formulated to map the given latent space to the image space : → . ⊆ , where denotes n-dimensional latent space. Here, is the generator, is the latent space and is the image space. ∈ , where is a latent space vector and ∈ , where is a sample in the image space. Figure 4 shows the generative network, where a random sample from a given distribution is passed to the generator G. The generator generates image i, and the loss is calculated as the feature-wise difference between the generated distribution and the real distribution.

Facial expression recognition
Deep CNN is used for extracting facial features and emotion recognition. Fine-tuned VGG19 architecture is used in the model. The architecture of VGG19 is fine-tuned to optimize the classification performance of deep CNN. The dropout technique is used between the fully connected layer and the final convolutional layer to avoid over-fitting. The final fully connected layer uses softmax for classifying the expressions into one of the seven categories. The softmax activation function output is represented as probabilities corresponding to seven different classes, which sum up to 1. The cross-entropy loss function is used to handle the noise labels and for faster training. Another advantage of using cross-entropy as a loss function is improved generalization capabilities. Figure 5 shows the fine-tuned VGG19 network. The output of the classifier corresponding to maximum probability is determined as the expression of the facial image, represented by the equation,
To perform facial expression recognition, CK+ dataset is used, which is released as an extension of the Cohn-Kanade (CK) dataset [49]. The CK+ dataset has 593 image sequences of 123 subjects. Among the 593 sequences of images, 327 sequences have labels containing the emotion. The last three frames of each of the 327 are extracted from the dataset, making 981 facial expressions. The dataset is more robust and reliable as the dataset was obtained under a laboratory environment. Data augmentation is done to expand the database volume. 10-fold cross-validation is performed to improve the accuracy. Seven different facial expressions, namely happy, sad, fear, surprise, disgust, neutral and angry, are classified. Figure 6 shows sample images from the CK+ dataset displaying seven different emotions.

Loss function
Cross-entropy is used as a loss function to calculate the loss. The formula to calculate crossentropy is, The output probabilities of seven classes of a fully connected model are normalized to 1, ∈ 1, … . N using softmax activation function. Softmax activation function handles ∈ 1, … . . , where N is the number of classes. The formula to calculate softmax activation function is, ∑ (4) = Softmax activation function, N is the number of classes of a multiclass classifier.

Results and discussion
The proposed model is evaluated with a benchmark dataset and real images. The ground truth images are passed into the model to predict the latent code. The expression and identity components are distilled from the ground truth image. The results displayed below show that the generated images are very close to the ground truth image. Figure 7 represents the results obtained from the first stage of the proposed model. Figure 7(a) represents the ground truth image passed as input to the encoder to obtain the feature vector. Figure 7   In this section, the real faces are manipulated to analyze the performance of the proposed model for real faces. Figure 8 shows the results of generating facial images from the latent code of the image. Results show that the image can successfully predict the facial expression for real faces.    The performance of the model is improved using 10-fold cross-validation. The recognition accuracy of 72.38% is achieved, which is higher than other models on the FER2013 dataset. The images present in the dataset are noisy with low illumination, blurred and occluded. The recognition accuracy can further be improved by applying denoising techniques on the images and data augmentation can be performed to increase the number of the images. The model achieves high performance when compared to the models that handled only frontal view of the facial images [38]. Confusion matrix was analyzed to evaluate the performance in determining the facial expression. Figure 11(a) shows the confusion matrix for the performance evaluation of our model on the CK+ dataset for different expressions and the accuracy of overall expression recognition. The average recognition rate is 96.97%. The recognition accuracy represents that the model can recognize the facial expression regardless of the angle of the head. The confusion matrix depicts that the model performs exceptionally well for happy, sad, surprise, angry and disgust expressions with very high accuracies. One hundred percent accuracy is achieved for the expressions with a good number of samples for each expression. It should be noted that the classification error occurs in recognizing fear and neutral emotions. The accuracy of recognizing fear and neutral is low because of a small number of training images for the two expressions. The results suggest that automated models can perform equally well as a human observer does. Figure 11(b) shows the confusion matrix for the performance evaluation of our model on the FER2013 Public Test set for different expressions and the accuracy of overall expression recognition. Figure 11(c) shows the confusion matrix for the performance evaluation of our model on the FER2013 Private Test set for different expressions and the accuracy of overall expression recognition.  Table 1 shows the comparison of facial expression recognition performance with existing methods. The methods listed in the table perform facial expression recognition on the frontal view of the facial images. The proposed work takes multi-pose facial images and performs facial expression recognition. From the results, it can be observed that the model outperforms the existing state-of-art methods for multi-pose facial expression recognition.

Conclusions
The work proposed model to extract features from the latent representation of the facial image. The given facial image is encoded into feature vectors from which the input ground truth image is recovered back. The model recovers identity and expression discriminative representation of the facial image. Experiments were conducted using real images and the CK+ dataset. When compared with the existing works, the current work generalizes well and synthesizes visually appealing images preserving the semantics of the facial image. From the results, the proposed model is capable of extracting the facial expression regardless of the facial image view. The proposed model achieved state-of-art results with an accuracy of 96.97% for the CK+ dataset. The future work may explore facial expression