Empirical Investigation of Multimodal Sensors in Novel Deep Facial Expression Recognition In-the-Wild

The interest in the facial expression recognition (FER) is increasing day by day due to its practical and potential applications, such as human physiological interaction diagnosis and mental diseases detection. This area has received much attention from the research community in recent years and achieved remarkable results; however, a signi ﬁ cant improvement is required in spatial problems. This research work presents a novel framework and proposes an e ﬀ ective and robust solution for FER under an unconstrained environment. Face detection is performed using the supervision of facial attributes. Faceness-Net is used for deep facial part responses for the detection of faces under severe unconstrained variations. In order to improve the generalization problems and avoid insu ﬃ cient data regime, Deep Convolutional Graphical Adversarial Network (DC-GAN) is utilized. Due to the challenging environmental factors faced in the wild, a large number of noises disrupt feature extraction, thus making it hard to capture ground truth. We leverage di ﬀ erent multimodal sensors with a camera that aids in data acquisition, by extracting the features more accurately and improve the overall performance of FER. These intelligent sensors are used to tackle the signi ﬁ cant challenges like illumination variance, subject dependence, and head pose. Dual-enhanced capsule network is used which is able to handle the spatial problem. The traditional capsule networks are unable to su ﬃ ciently extract the features, as the distance varies greatly between facial features. Therefore, the proposed network is capable of spatial transformation due to action unit aware mechanism and thus forward most desiring features for dynamic routing between capsules. Squashing function is used for the classi ﬁ cation function. We have elaborated the e ﬀ ectiveness of our method by validating the results on four popular and versatile databases that outperform all state-of-the-art methods.


Introduction
Facial expressions contain the most important nonverbal and rich emotional information in the social communication [1] People communicate with each other through verbal and nonverbal communications [2]. Nonverbal communication is basically conveyed through facial gestures, eye to eye contact, facial expressions, and paralanguage [3]. According to earlier research, in the communication out of 100 percent, the information conveyed by facial expression is 50 percent, 40 percent is through voice, and 8 percent through language. Apart from that, due to the rapid progression in technology, we spend most of the time on electronic devices which carry a variety of software interfaces, which are tense, primitive, and nonverbal. So, facial expression recognition can further improve to have a more natural and intelligent humanmachine interaction.
Facial expression recognition is used in various domains like Intelligent Tutoring System (ITS), psychology, human-machine interaction, behavioral science, intelligent transportation, and interactive games [4]. It can be helpful in monitoring the abnormal expressions in the crowd at public places to avoid any crime. It can also be helpful in the service industry to timely capture the feedback of customers, and it can provide timely treatment of patients by looking at the real-time expressions of the patient at the hospital. According to Ekman and Friesen [5], there are six basic expressions: happiness, surprise, disgust, fear, sadness, and anger (some researchers have termed neutral expression as the seventh expression). These expressions are conveyed almost among all species.
Facial expression recognition is widely studied by various researchers. Despite the available research, robust FER is yet an open and challenging task [6,7]. However, most of the recognition algorithms did not consider interclass variations caused by the differences in facial attributes of the same individual. Hence, mostly, expression classification is done through facial expression information along with identityrelated information [8,9]. The main drawback it carries is that it affects the overall generalization capability of FER systems, thus resulting in the degradation of the performance on unseen identities [10]. An efficient FER system plays a vital role in the treatment of patients by observing their variable behavior patterns. Happiness expression depicts a healthy and positive mental state, while sad and angry demonstrate an unhealthy mental state. Different mental diseases like autism or anxiety are detected due to the emotional conflicts of a particular patient. Most of the FER systems receive physical signals from camera, but it is also important to observe physiological signals that can be captured with the help of various other sensors. An important application of FER is E-health care; nowadays, almost 0.3 billion people are suffering from depression which can also lead to suicidal tendencies if they are not treated timely and effectively [11]. In general, the mental health treatment observes a lot of barriers like financial cost, social stigma, and shortage of accessible options. Normally, clinical staff holds an interview with a patient for checking symptoms of the depression via verbal and nonverbal indicators. Patients are asked to fill the questionnaire for measurement of depression severity [12]. In order to get timely detection of depression symptoms, an AI-based system will help in entrenching barriers for timely and effective treatment. With the help of multimodal sensors, clinicians can get automatic tools for screening tool of depression by audio, visual, and linguistic signals. Meanwhile, a patient can get his mental status with the help of an AI application in a portable device like the camera.
In this paper, we use a combination of different techniques to develop a robust model. Initially, we implement different preprocessing techniques to fine-tune and remove highly uncorrelated information in the images. Face detection is performed using facial attributes due to the following reasons.
(1) The human face is basically having a unique structure, where the most important local facial parts such as eyes, mouth, and nose help us to detect the face in an unconstrained environment. So, the partness map or response map of five different parts is used in the method. (2) The faces observe spatial arrangements like hairs will be above eyes, lips below nose, and so on. Hence, faceness score has been derived from the response configuration. (3) The face hypothesis is per-formed for the estimation of more accurate face locations. Our contribution is to introduce a special attributes supervision in order to discover facial part responses. We adapt Deep Convolutional Generative Adversarial Network (DC-GAN) for data augmentation. It helps us in the demonstration of realistic data-augmentation and improvement in the generalization performance in the low-data regime. In facial expression recognition, the focus has been shifted towards multimodal sensor data acquisition, adding a reasonable number of intelligent sensors to extract features accurately. This works well in labcontrolled environments and health care facilities. We use multimodal smart sensors for illumination variation pose/head detection and infrared for targeted area/personnel. Considering E-Health care systems, we use sensors for capturing electrical signals such as electrocardiogram (ECG) and electroencephalograph (EEG). A lot of sensors are needed to be added to a system that will be impractical to be used in the real-life scenarios which are most probably be emergency situations such as flooding, fire, earthquakes, and tsunamis. Our proposed method combines normalization techniques to counter environmental noise and unwanted data that hurdles in the accurate emotion detection in the wild.
For accurate and robust FER, feature representation of the facial images is the most important step. A considerable amount of research has been done over local and global feature extraction [13]. Fan et. al [14] suggested a model, i.e., MRE-CNN which aimed to enhance the learning power of the convolutional neural network by considering both the local and global features. Whereas Li et al. [15] introduced the DLP-CNN framework, in which the discrimination power of deep features is enhanced while maximizing the interclass scatter and by preserving the locality closeness. Still, they are unable to find the relative relationship between the local features. Face is composed of a certain structure where every part is having a relative relationship with other part. To address this issue, we propose a method which is capable of spatial transformation due to action unit aware mechanism and thus forward most desiring features for dynamic routing between capsules. Finally, squashing function is used for the Classification purpose. We assess the effectiveness and performance of the introduced model on the Extended Cohn-Kanade, MMI, Oulu-CASIA, and Realworld Affective Faces (RAF) databases. Figure 1 shows some sample images from CK+ database.
The main contributions of this paper are as follows: (i) For robust face detection, we introduce a special attributes supervision in order to discover facial part responses (ii) We propose a dual-enhanced capsule network, which is able to extract the effective relationship between the features from different local regions. Spatial information is also encoded due to the knowledge of the probability of an object existence (iii) The proposed method uses an action unit awareness mechanism which captures more effective and robust information (even subtle muscle motions), 2 Journal of Sensors used for dynamic routing between the capsules, and as a result, provides with much better feature representation The organization of the next sections is as follows. In Section 2, we provide the problems with the existing methods. In Section 3, we elaborate our novel architecture with the underlying information. Section 4 comprises the results and analysis. Finally, we provide the conclusion of our research and explain the direction for future work in the last section.

Related Work
The main goal of FER is to capture the meaningful features which are discriminative and descriptive and invariant to facial variations such as occlusion, illumination, pose, and other identity-related details. There are two main methods available for feature extraction: (1) handcrafted and (2) deep learning-based method. Nowadays, deep learning methods are gaining remarkable results. However, earlier mostly facial expression recognitions were based on handcrafted/humanengineered features such as Histograms of Oriented Gradients (HOG) [16], n-dimensional scale-invariant feature transform (n-sift) [17], and Local Phase Quantization (LPQ) [18]. These methods are used for the extraction of the global as well as the detailed information of an individual face. However, the information obtained is from the overall facial region, and it ignores the expression changes in the local regions which contain eyes, nose, and mouth. These methods perform pretty well in the lab-controlled environment where subjects pose expressions under the constant illumination, stable eye gaze, and head pose movement. Existing handcrafted approaches demonstrate comparatively less recognition accuracy. Efforts are exerted for manually extracting the desired discriminating features which are linked to expression changes. Considering in-the-wild scenarios deep learning methods for the robustness of facial expression recognition has been implemented [19][20][21][22]. However, deep representation is affected just because every facial attribute of a particular subject carries a hefty number of variations such as gender, ethnicity, and age of the particulars posing expressions. It holds a very big disadvantage, i.e., the generalization capability for any model is highly and negatively affected; as a result of unseen objects, the performance of facial expression recognition is degraded.
Although quite a work has been done in the area for the improvement in the performance of FER, still alleviating the influence of intersubject variations is a challenge and open area of research.
Several techniques have been implemented by reducing intraclass variations and by increasing the interclass differences. Hence, it further increases the discriminating property of the features extracted for FER in the real-time scenario [23]. Identity-Aware CNN (IACNN) proposed that by reducing the influence of identity-related information with the use of expression and identity-sensitive contrastive losses, the facial expression recognition performance can be enhanced [24]. The island loss has been proposed for extracting the effective discriminative features for FER [25]. Moreover, in [26], with the use of residue learning, the personindependent expression representation has been learned. However, this technique offered computationally costly; in addition, due to the same intermediate representation used for the generation of neutral images for the same identities, it also was unable to disentangle the expression information from identity information. However, in [24], due to large data expansion caused by the compilation of training data in image pair forms, the effectiveness of contrastive loss is heavily affected [25]. Similarly, in [27], a fixed identity has been proposed for the transfer of facial expressions to fix the influence of identity relative information. The problem still persists with the methods as the efficiency of FER depends on the expression transfer procedure. In short, it has been noticed that FER based on the deep learning methods has outperformed the traditional handcrafted methods. However, there is still a gap in deep learning because very few studies have employed facial depth images in the deep networks as an input. Compared with the existing models, the main goal is to design a network which can be fully adopted for decomposition of the facial region, easy to implement, and is robust.
The shifting of facial expression recognition (FER) from laboratory environment to real-world/wild decreases the efficiency of recognition of correct features from 97 percent to almost 50 percent due to the challenges coming from the environment. These are due to the changing factors such as illumination, head poses, and subject dependence. This problem is mainly solved using multimodal sensors, sensors that are used alongside the main sensor which is the camera. E-  Journal of Sensors health care system uses electrical devices attached to the subject body or electrodes that need to be injected in the subject's body alongside image/video to make emotion detection more accurate and better facilitate its patients [28]. This makes it unfeasible in the field for an emergency situation. Some of the work is done to achieve accurate results in the wild by improving methods and not the number of sensors in such environment. Challenging light conditions in the wild is taken as a bigger challenge in extracting accurate facial features. Liu et al. [29] devised such a system to handle the illumination problem by using three different classifiers (kernel SVM, logistic regression, and partial least squares), which could be used in the wild. Another work by the same authors [30] shows that this problem can also be handled using the developed method. Recent advancements in technology in sensors and methods, computer vision, speech recognition, deep learning, and related technologies have made emotion detection more accurate and efficient [31,32]. This still makes it a subject of adding more sensors, which makes sense in its own usage area, but speaking of real-world application in the wild it would be rather feasible to stick to a minimum number of sensors and achieve a method to solve most of the issues if not all.

Preprocessing. Preprocessing is very important as it aims
to capture the meaningful features and align and normalize the most needed visual information conveyed by facial image. Every real-time image is affected by nonlinear facial variations, i.e., varying illumination, difference in the contrast between the foreground and background, and irrelevant head poses. Therefore, to get the maximum possible semantic meanings of the features for further training the deep neural network, we need to perform some preprocessing techniques. This step is used for the elimination of highly uncorrelated data in the image.
3.1.1. Face Detection. Face detection is one of the vital steps in the FER, just due to the excessive background and there is still highly uncorrelated information in the image even taken from few benchmark datasets. As most of the datasets are having almost frontal view and high-resolution images. So Viola and Jones algorithm [33] is used in the most scenarios.
Faceness-net has been used in this paper [34]. A CNN supervised with facial attributes can detect faces even when more than half of the face region is occluded. In addition, the CNN is capable of detecting faces with large pose variation, e.g., profile view without training separate models under different viewpoints. A full image is provided as an input image to the convolutional neural network for the generation of the partness map. The partness map is generated for different facial parts like eyes, nose, and mouth. Facial attributes are further categorized in order to distinguish them from other parts. Just like hairs, it can be blond, black, wavy, straight, etc. So in the next stage, face proposals are much more refined, so that the usefulness of facial attributes are explored for learning optimized and robust face detection. A CNN is trained over uncropped images and is used for obtaining face part detectors without any explicit part supervision. The faceness score is evaluated based on the face part responses and considering the spatial arrangements associated with them. The method is trained on datasets on the following datasets CelebA for face recognition and AFLW for face alignment. After the generation of face proposals, a strong face detector is trained and it outperforms all other methods.
In Figure 2, the face is divided into five important parts, where eyes, nose, and hairs are much more effective as compared to mouth and beard which can be partially occluded. Therefore, the combination of facial parts gives a much better result instead of individual facial parts.
3.1.2. Data Augmentation. As far as deep neural network is chosen for FER, data augmentation is used to produce many better results by providing a large amount of data. It is effective in the generalization capability of the model as many of the publicly available datasets are not large enough to validate the results more efficiently. Large training data yields to a well-trained model.
There are some standard methods of data augmentation like skewing, rotating, shifting changing color scheme, resizing the image, and enhancement of image noise [22]. To automatically learn the augmented data in the low-data setting, we have used Deep Convolutional Generative Adversarial Network (DC-GAN) [35]. It is used for the alleviation of the overfitting problem over the on-the-fly data. The samples provided as input are randomly cropped from all the four sides, and then, the horizontal flip is performed for making a dataset ten times bigger than the original one.

Multimodal Sensors.
In FER systems, the camera is the most commonly used sensor for capturing images and videos of a subject. However, it works well in a lab-controlled environment but its efficiency reduces in the extreme wild real-time scenario. It faces three key challenges: illumination variance, subject dependency, and head pose. The solution to this problem is by adding other dimensions to the feature vector that can be attained from other sensors. To address these problems, we leverage multimodal sensors given below.

Eye
Tracker. This is a type of visual sensor used to find some new patterns from a particular section of a face. Among other facial parts, the human eye is considered to be an important part, which helps us to analyse the mental status of an individual such as focus, attention, presence, and consciousness. This can be achieved with the help of an eye-tracking sensor that can provide information about the exact focus of eyes. It assists in the improvement of efficiency in facial expression recognition.

Nonvisual Sensors.
There are three main types of nonvisual sensors. The first among the list is voice that is pretty correlated with the body [36]. Journal of Sensors the fusion of audio and visual information [29]. We select one-score linear fusion method for taking the optimal features from both visual information and audio signals. The next among them are sensors for different physiological signals. There are four physiological signals but our primary concern signals are electrocardiogram (ECG) and electroencephalograph (EEG). The former method deals with the heart signals and the later deals with the brain signals. It is basically the acquisition of electrical signals obtained from both of the signals; they are obtained with the attachment of electrodes to the skin. Most of the researchers encouraged EEG because of the robust signals from the brain. However, sometimes, people do not necessarily express their emotions with the facial movement, and illumination variance also affects the recognition process. So, the help of fusion with physiological signals boosts the robustness of the FER system.
The third one is Depth Camera; the in-depth image pixel intensities provide more robust features as compared to RGB images. They have been used in a variety of applications like face detection and body motion recognition. Modified local directional pattern features are obtained from sorting the strengths, among eight directional per-pixel in-depth strengths. These sensors are normally used for the safe privacy of individuals, and it can help us in the noise-affected images. Multichannel features are extracted from RGB facial images, which results in robust FER in an unconstrained environment. Figure 3 shows the fusion of multimodal sensors with pure image/video processing. (DE-Capsnet). The entire network has been shown in Figure 4, where the model is divided into portions. Firstly, we have to preprocess the images to avoid the uncorrelated information linked to the facial image. Then, we have two modules for further processing. In the first part, the box with the purple dashed line is attention aware of action units and consists of deep convolutional layers for extraction of the enhanced feature maps, and this has been termed as enhancement module 1. In the later part, with the use of dynamic routing, those enhanced feature maps are encoded between capsules, and the process of decoding is done by the fully connected layers (the process has been shown in the green dashed lines). In the end, the squashing function is used for the recognition of facial expressions.

Dual Enhanced Capsule Network
VGG19 is used in the enhancement module 1 just because of the reason that it is very much robust in object classification; meanwhile, it is having a simple architecture too. For a better understanding of the description, each stage is having multiple convolutional layers followed by a maxpooling layer. In the first 2 stages, each stage is having 2 convolutional layers. Whereas in the last 3 stages, each stage is having 3 convolutional layers, respectively. We do not retain the last 3 layers as we have to get the feature maps.
In order to achieve the attention map, we have used the generation method by Li [37]. Furthermore, we have made appropriate adjustments to the datasets used in our work for getting the key facial landmarks. Figure 4 is showing the facial image with having blue facial landmarks, and the attention map has also been showed. Action unit's centres are obtained with key facial points by using scaled distance. In order to make sure that the scales must be the same among all the facial images, the facial images are resized. Hence, for making the shifting distance among the images as much adaptive as much as we can, the measurement reference is used for the shift in distance. To locate the action unit centres of the inner corner distance has been used as scaled distance. For each action unit, the 7 pixels in the nearby area have been taken in the experiments; as a result size of each action unit area is 15 × 15. Hw is assigned as the higher weight, which are the closest points to the action unit centre.
The manhattan distance is termed as m d to the action unit centre. Hence, those areas which are having higher values in the attention map correspond to the active areas of action units in facial images; the attention map will further enhance them.
After the generation of attention maps, the maps are further forwarded to stage 3 and stage 4 as shown in Figure 4. The feature maps which are generated after the pooling layer of the second stage are multiplied with the attention map of the first stage and after that being parallel with the convolutional layers of the third stage. Hence, the results obtained after the convolution are added element by element and, then, forwarded to the max-pooling layer of the current stage as an input. A similar operation is done at the fourth stage by jointly combining the convolutional layers with the attention map. Here, we want to explain the reason behind using attention maps; it is just because all areas are not equally important for facial expression recognition.
After the enhancement module 1, we get 512 × 7 × 7 feature maps. For the dynamic routing, the feature maps are further fed between primary capsule layers and face capsule layers. Three fully connected layers are used for decoding and reconstructing the facial image. The nonlinear function, i.e., squashing function is used for facial expression recognition which is defined in Equation (2) as

Journal of Sensors
where k is used for the capsule and u k and j k are output and input vectors, respectively. L m is the minimizing margin loss and L r is the reconstructing loss used for updating the parameters in the network. Total loss is defined as L t . Loss function expressions are defined in Equations (3)-(5), respectively.
where cc is termed as the classification category, and for that particular category, the indicator function is denoted by I cc . The upper and lower boundaries are represented by b + and b − . The f represents the original image, whereas f c represents the reconstructed image.

Performance Metrics
To evaluate the merits of the proposed method, there are some performance metrics used in the quantitative comparisons of different approaches. There are different methods for comparing recognition rates with other approaches. The difference between the methods depends basically on the difference in division of the training sets and test sets. Facial expression recognition basically comprises of multiclass. 10fold cross-validation is performed for dividing each expression category in the number of training and test sets. This  Journal of Sensors method is useful in utilizing all the samples. It is also capable to effectively avoid overfitting and underfitting problems. A set of evaluation metrics is required to discriminate and obtain an optimal classifier. There are few evaluation metrics implemented in this work to populate the effect of recognition rates. The performance metrics are as follows.
4.1. Accuracy. Accuracy is one of the most widely used metrics in classification problems. The accuracy is achieved as a ratio of correctly predicted samples to all the predictions made. For calculating the average accuracy, we need to average all the accuracies for each category of expression. To calculate the accuracy, the equation is given as follows: where "p" stands for positive and "n" stands for negative predictions, respectively. Figure 5 represents the average accuracy rates on the aforementioned databases.

Precision.
Precision of each class is the ratio of true positives to true positive and true negative. Precision is highly dependent on the threshold value of classifier. If the threshold value was set high previously, then the precision will increase because the new results may lead to all positive. If the threshold value was low or about right, then the precision will be decreased by lowering of the threshold. It will create more false positives. The equation of the precision is as follows: 4.3. F1-Score. The F1-score also called F-measure or balanced F-score is the harmonic mean of the precision and recall for each class. The precision and recall equally contribute in the weighted average of the F1-score. The equation is given as follows: where Recall is the ratio of true positives to all the samples which supposed to be identified as positive.

Results and Discussion
We have used the four most popular databases for populating the results. These databases are CK+ [38], MMI [39], Oulu-CASIA [40], and RAF [23]. The RAF is used for large posed and real-world expressions, as the first three do not have large posed expressions. So to check the robustness of our method over large posed expressions, we have used the RAF database.

Description of Databases. The Extended
Cohn-Kanade is the most wide and most popular database used in the facial expression recognition. It contains 593 video sequences, which do vary from 10 to 60 frames with a shift from neutral to other expression. There are a total of 123 subjects who performed different expressions, the ages of the subjects ranging from 18 to 30 years. Out of the 123 subjects, most of them are females. 327 video sequences out of them are categorized into seven expressions. The core reason behind the algorithms not being uniform over CK+ is that it do not provide specific training, validation, and test sets.
The MMI database is laboratory-controlled, and 75 subjects have performed 2900 expressions both video sequences and static images with high resolution, out of which 326 video sequences are obtained from 32 subjects. The MMI database is different from CK+ as it is using both onset, offset, and apex phases. In the sequences, the neutral expression is performed at the start of every sequence and reaches at the peak and, then, return back to the neutral expression. This database is having very challenging conditions, i.e., it is taking care of large interpersonal variations; every subject is performing different nonuniform expression while wearing glasses, moustaches, etc. The Oulu-CASIA database consists of 2880 images from 80 subjects for six expressions; most of them are males between 23 and 58 years. This database is specially designed to tackle the problem of illumination due to the environmental changes. It consists of two different imaging systems; the first one is Near Infrared (NIR), whereas the second one is Visible Light (VIS). There are 3 different variable illumination scenarios: the first one is normal indoor illumination; the second one is used for weak illumination considering the scenario where just the computer display is on, and the third one is having all the lights off, i.e., dark illumination.
The Real-world Affective Faces Database is used, which consists of 29672 great diverse real-world facial images. These images are downloaded from the internet based on the approach of crowdsourcing; 40 annotators are used for independently labelling each image. This database consists of large variability in different subjects' gender, age, ethnicity, varying lighting conditions, head pose, eye gaze, occlusions, and postprocessing operations, which helps us to validate our network over versatile databases.

Implementation Details.
The facial image is first preprocessed using face detection, data augmentation, and illumination normalization (a weighted summation approach has been used in our work for combining histogram equalization and linear mapping) for fine-tuning of the image. The highly uncorrelated data is removed in order to process it further for a high-quality result. Then, the landmark detection is used to detect the key facial points. After that, VGG19 is used as a backbone of the network, where feature maps of 512 × 7 × 7 are obtained after the 1st enhancement module. Then, 256 × 6 × 6 feature maps are obtained from 2 × 2 convolutional kernels having the stride value of one; those feature maps are further forwarder to primary capsule layers with 8D capsule and 32 convolutional layers. There are 3 routing iterations which are then executed between the primary capsule layers and face capsule layers. Every expression is having a 16D capsule, where all the lower capsules forward information to the above capsule. Then, with 3 fully connected layers, 7 Journal of Sensors we use the squashing function for further classification. Adam optimizer is used for learning with a rate of 0.0001. The value of b + is 0.9 (upper boundary) and b − is 0.1 (lower boundary), respectively. Furthermore, the batch size is set to 16, and the maximum iteration is set to 300. Our whole network training is end-end.

Discussion.
In the Extended Cohn-Kanade database, we take the last frame to three frames and consider the first frame as a neutral expression for data selection. The subjects have been divided in a group of 10, and 10-fold crossvalidation is performed. Table 1 shows the average accuracy rates compared with other existing state-of-the-art methods. Our image-based method achieves the highest accuracy of 98.95 percent against sequence-based techniques which extracts the features from a sequence of images or videos. In the MMI database, we take three frames from the middle of each sequence that are associated with peak information and develop a dataset consisting of 624 images. Afterwards, the data augmentation is performed and then distributed among 10 sets. For experimentation, the 10 cross-fold person independent validation is performed using the first frame, i.e., neutral expression, and it takes three peak frames from every frontal sequence. Table 2 shows the dominance in the average accuracy rates compared with other existing methods.
In the Oulu-CASIA database for training and testing, we use the last three frames from every sequence. 10 fold crossvalidation is performed just like CK+ in which based on the subject; each fold is completely disjointed with all the remaining folds. Table 3 shows the average accuracy rates, which outperforms all novel methods. It achieves the highest accuracy of 91.2 percent.
Just like other databases in the RAF database, we perform 10-fold cross-validation too. Table 4 shows the average accuracy rates of our method on the RAF database. We first obtained the true positives, false positives, true negatives, and false negatives; then, over 10 folds, we calculated the F1 score and precision per class. Figure 6(a) is showing the per-class precision, and Figure 6(b) is showing the per-class F1 score on the following databases. Whereas, Figure 7 represents the classification results on real-time images.  Considering the agreement of facial expressions by face angles, we noticed that perceived arousal from the frontal face is more than compared with the shift in face angle. Whereas, the happiness disgust with closed mouth and surprise remains unaffected with the face turned away. Furthermore, the effective valence near to the frontal is conveyed more by the full left side profile rather than the full right side profile. It is because of the reason that the left hemiface observes more spontaneous response than the right hemiface. The facial expression analysis can be enhanced by the facial motion information if the image is subtle or degraded. The dynamic neutral expression with blinking of eyes or chewing is also a threat. Moreover, the dwell time is also a key factor; it takes more time over eyes than mouth. However, the dwell time over mouth of happy expression is relatively high. With the increase in the intensity, it can also be noticed that the accuracy is also increased, whereas the dwell time and round trip are decreased. Overall, the response time of females is faster than males even in a low-intensity environment. In the end, it was also concluded that the dwell time of the female eye is more than that of the male. The results and discussion may be presented separately, or in one combined section, and may optionally be divided into headed subsections.

Potential
Applications. The existing architecture has many practical applications in a wide range of areas [55]. We discuss its feasibility in a hospital where patients' health risks are being assessed using their FE's. Such applications are feasible because in a hospital not all patients are attended at all times. For this approach to be effective, hospitals can be installed with sensors in different areas to preprocess patients' data. Fluctuating FE's can be sent to a local sink node, and alarms can be generated based on a specified protocol on a patient-by-patient basis. Another application is the detection of depression among the population in office setup where employees move around frequently. In such wild environments, the head poses of employees can seriously limit the capability of the system to evaluate the facial expressions. Multimodal sensors can play a vital role in such scenarios by extracting the right kind of attributes for better detection. Other possible applications can be in schools and colleges for evaluation of a healthy learning environment for youth, monitoring of babysitters in day-cares across different organizations.

Journal of Sensors
Multimodal sensors can reduce the overall footprint of the application data by occluding useless background information. Furthermore, redundancy of data can be filtered out using sensor nodes by using local node computation. Apart from that, sensors can be power aided in a local environment thereby creating no power constraints for the resource-constrained WSN nodes.

Conclusions
In this paper, we have introduced a state-of-the-art architecture that is robust and effective. A facial image is first preprocessed using different techniques to counter the problems of the excessive background, limitation of data, varying illumination, pose-variation, and occlusion. The facial image is finetuned and then forwarded to a dual enhanced capsule network which is capable of handling the spatial transformation. It uses action unit aware mechanism, which helps to locate the active areas which can help in better facial expression recognition. The feature representation ability is enhanced due to multiple convolutional layers, and it helps to capture the key information present in the particular structure of the face.
Different databases have different set of pictures under varying conditions. As a result, the class imbalance is occurred due to the inconsistency in expression annotations. So a costsensitive layer can be enhanced for training the deep neural networks. Meanwhile, a powerful deep neural network can be designed having the prior knowledge of change in the local environment, which can be capable of predicting specific parameters and inherently handling and recovering facial occlusions without any intervention. Furthermore, to improve the robustness of the FER, it can be fused with other models. The incorporation with other modalities like depth information from three-dimensional face models, neurosciences, cognitive sciences, infrared images, and physiological data can be a good future research direction.

Data Availability
All the data are included within the article.

Conflicts of Interest
The authors declare that they have no conflicts of interest.