Emotion recognition using a glasses-type wearable device via multi-channel facial responses

We present a glasses type wearable device to detect emotions from a human face in an unobtrusive manner. The device is designed to gather multi channel responses from the user face naturally and continuously while the user is wearing it. The multi channel responses include physiological responses of the facial muscles and organs based on electrodermal activity (EDA) and photoplethysmogram. We conducted experiments to determine the optimal positions of EDA sensors on the wearable device because EDA signal quality is very sensitive to the sensing position. In addition to the physiological data, the device can capture the image region representing local facial expressions around the left eye via a built in camera. In this study, we developed and validated an algorithm to recognize emotions using multi channel responses obtained from the device. The results show that the emotion recognition algorithm using only local facial expressions has an accuracy of 78 percent at classifying emotions. Using multi channel data, this accuracy was increased by 10.1 percent. This unobtrusive wearable system with facial multi channel responses is very useful for monitoring a user emotions in daily life, which has a huge potential for use in the healthcare industry.


I. INTRODUCTION
Emotion recognition is a technology to predict people's emotional states based on user responses such as verbal or facial expressions [1]; this technology can be applied in various fields, such as health care [2], [3], gaming [4], and education [5], [6]. To aid these applications, the technology should recognize emotions in real-time and naturally while the user is experiencing them. Recently, wearable devices have garnered attention for emotion recognition applications [7].
Most existing wearable devices for emotion recognition have relied on biosignals. The biosignals are electronic signals that indicate physiological responses, such as a pulse or a sweating response. These signals originate from changes in the autonomous nervous system (ANS), which is a control system that regulates human bodily functions. With regard to the ANS, there is a risk of misunderstanding a user's emotional state if the wearable devices relies on the biosignals alone, because the ANS is affected not only by emotion but also by other factors, including cognitive stress [8], [9] or physical activities [10]. For instance, many people in workplaces experience physical activities and cognitive stress, which affect their biosignals; therefore, using only the bio-signal may not be reliable [11]. In this case, it would be desirable to use additional modalities to obtain more reliable emotional information.
Facial expression can be used as an additional modality here as it provides important cues for emotion recognition.
This modality is already used in studies on wearable devices [12]. There are two methods for extracting facial expressions via the wearable devices. The first is the sensor-based method, which measures the movements of facial muscles to reflect emotions [11], [12]. This approach might cause discomfort due to contact with the skin on the facial muscles.
The other method is the camera-based method, which captures facial expressions using a camera [13]. This method has advantages over the sensor-based method because the camera is not attached to the skin. Nevertheless, camerabased methods have been not used frequently in wearable devices because the modules were cumbersome and heavy to wear [14]. However, owing to advancements in technology, the sizes of camera modules have become small enough to wear comfortably.
Recently, camera modules have been used in commercial wearable devices [15]. In particular, they have been primarily used in head-mounted wearable devices to capture the user's perspective, and there have also been emotion recognition studies using the captured pictures. However, these studies used pictures to recognize the emotions of other persons and not those of the users [16]. We assume that it is beneficial to use camera modules for monitoring the user's own emotions using these glasses-type wearable devices.
In addition to situational changes, individual differences between users could also affect the use of wearable devices. We have considered two cases as examples. In the first case, the sweat glands of the user may exhibit low conductivity or the change in the user's pulse according to the emotional state may be less [17], [18]. Biosignal based wearable devices would not be appropriate for such users. On the contrary, some people show very little difference in facial expression. In this case, facial expression based wearable devices would not be suitable for such users. Multi-channel facial wearable devices could widen the customer range by increasing the robustness of these device to personal differences.
In summary, to check the feasibility of our idea, we propose the following two hypotheses. 1.
The local, side facial image can be used to monitor a user's emotional state.

2.
Combining local facial expressions and biosignal yields a better outcome than using a single channel. To validate this hypothesis, we developed a glasses-type wearable device that can measures multi-channel facial responses; local facial expressions and biosignals. Using the developed device, we conducted emotion recognition experiments with video clips to elicit emotional responses. The experimental results demonstrated that we were able to classify the four emotions with an accuracy of 76.09%, which is higher than the case where a single channel is used.
Our study proposed a new wearable system to recognize user's emotions using multi-channel responses including facial expression. The proposed system provides more accurate, stable, and natural emotion recognition for healthcare applications, for instance monitoring patients with mental disorder [19], [20], [21].
A preprint of this paper is at https://arxiv.org/abs/1905.05 360 [22].  Table 1 lists the related studies. We mainly compared the studies that used glasses-type wearable devices or conducted experiments to recognize the induced emotional state.

A. Related Works
Majority of studies that used glasses-type wearable devices were mainly conducted using recognition experiments based on forced expressions, which means that the user deliberately made facial expression. Such experiments lack natural expressions from the user. The most natural way to obtain the emotional state is to record the user's response while they are doing some task, such as watching a video. Therefore, we designed an experiment to induce emotional states and captured the naturally expressed facial expression. Although such experiments have been used in studies on biosignal based wearable devices [24], [26], they have not been used in studies in facial expression based wearable devices. Therefore, our study could help researchers who aim to study facial expression based wearable device for emotion recognition in situations with high naturalness. Furthermore, our experiment was conducted with several subjects. Because emotional response patterns are highly dependent on individual differences [14], a large number of subjects is important to generalize the results. In this aspect, our study used a much greater number of subjects than those in other studies on wearable devices, except a large scale research [24]. This number of subjects guarantees the reliability of our results.
We published our collected data online. The dataset is available on https://neurocomputing.wixsite.com/nclab/dataset. The multi-channel wearable device for emotion recognition was designed to extract facial expressions and biosignal. To acquire measurements easily, the device was designed in the form of glasses-type wearable, similar to Google Glass [15] or the prototype sunglasses for emotion recognition presented in CES 2017 [27]. The Internet protocol (IP) camera module was attached to the left side of the device to capture the local facial expressions around left eye (from the eyebrows to the cheeks). To measure the electrodermal activity in response of emotional state change, electrodermal activity (EDA) sensors were attached to the surface of the skin in contact with the nose and mastoid. To perform plethysmography (PPG), ear-cliptype SpO2 sensors were attached to the earlobes, which has been frequently used in previous studies [28], [29].

A. Hardware Implementation
We used wireless communication for transferring acquired data. Two different wireless communication protocol were used for each modality in order to prevent wireless interference. the facial expression images were transferred via WIFI, and the biosignal were transferred via Bluetooth protocol. All the attached devices were powered using a custom rechargeable Li-polymer battery of capacity 3000 mAh; the devices could be continuously operated for about 2 h. The device was designed to swap out the battery when the battery runs out.   The position of the EDA sensors was more carefully determined than the other sensors. Sweat glands, which are the sources of the EDA signal, are distributed with different densities on the body [30]. Therefore, it is necessary to find the best location to contact with the EDA sensors. Two facial parts were selected considering natural contact with the device and little change by the facial expression by analyzing the skin movement during the facial expression using the 3D camera [31]; nose, and mastoid. Then, we prepared three candidate position by combination of these two parts. The first position is on both sides of the nose. This position can be contacted by the nose-supporting part of the glasses. The second is on both sides of the mastoid, which is just behind the ears. This position can be contacted by the earpiece of the glasses. The last position is a combination of the ear and mastoid. The EDA signal measurement was analyzed to compare these positions in the experiments. The experiments were performed with 10 subjects. The subjects sat on a chair 90 cm away from a 19inch LCD monitor. EDA sensors were placed in the three candidate positions and on the left index and middle fingers, and the subjects were requested to watch video clips for approximately 90 minutes. The EDA signals were measured simultaneously from all the positions while the subject watched the movie clips. All the procedures in the experiment were certified by the Institutional Review Board of KIST-2015-005.    We prepared a video-clip-based stimulus to induce emotion in the subjects. We intended to induce the four-dimensional emotional state [32]. The target emotional states include high arousal with high valence (HAHV), high arousal with low valence (HALV), low arousal with low valence (LALV), and low arousal with high valence (LAHV), which correspond to each quadrant of the arousal-valence plane [33].

D. Video Stimulus Selection by User Tagging
Two-minute movie clips were used as stimuli to induce emotions. All the clips were extracted from Korean movies because the language was important for inducing the HAHV state was important in the pilot study. The movie clips were carefully selected to induce only one emotion in order to prevent mixing with other emotions. Ten movie clips were selected for each category of emotions. Next, we recruited sixty subjects for the online survey, and the degree of the emotion of the clips was ranked by each subject to determine if the stimulus induced strong emotions. During the survey, each clip was watched and given arousal and valence scores between 1 and 9 by each subject. Fig. 5 and Fig. 6 is the questionnaires after watching each emotional videos. For subjects unfamiliar with the concept of the arousal and the valence, the SAM (Self-Assessment Manikin) pictures was provided. After the survey, the clips were sorted according to the distance from the origin, and two clips were excluded for each quadrant plane, which were close to the origin. As a result, a total of 32 video clips were selected. Fig. 7 indicate of the distribution of the SAM (Self-Assessment Manikin) scores obtained from the online survey. Each circle represents the video clip. The position of the circle indicates the average SAM score of the clip, and the width and height of the circle represent a standard deviation of the valence and arousal scores of the clip, respectively. The distance from the center (5, 5) was measured. The average distance for the HAHV videos was 2.52 and the standard deviation was 0.48. The average distances for the HALV, LALV, and LAHV videos were 3.56 (standard deviation: 0.51), 2.48 (standard deviation: 0.37), and 2.56 (standard deviation: 0.42). The emotion-inducing experiments were conducted using the selected stimuli. The experiment was organized into 32 trials. In each trial, the stimulus clips were shown for 2 min and the neutral clips were shown for 30 s. The neutral clips were shown to neutralize the emotional state between the trials. The emotions induced during the trials were counter-balanced to avoid label bias. E-Prime 2.0 [34] was used to present the stimulus.

E. Experiment to acquire induced emotional response
Experiments were conducted using the stimulus clips. A total of 24 subjects participated in the experiment. The data from the four subjects were excluded due to technical error while data acquisition. A 1.7 m × 1.9 m × 3.0 m shield room was used for the experiment. The room contained a 19-inch monitor and a 2-way speaker. The distance between the subject's head and monitor was 1 m. To acquire data, the wearable device was worn by the subject, and the physiological signals and facial expressions were recorded as the subject was watching the clip. The biosignal data were recorded at 180 Hz and transferred via Bluetooth. The facial expressions were captured at 5 Hz and transferred via Wi-Fi. The data-acquisition procedure was manually programmed on MATLAB 2017a. Additionally, to compare the emotion recognition performance with the reference, we attached the EDA and PPG sensors to the subject's left hand. A Biosemi Active Two (Biosemi Inc.) was used to acquire the EDA and PPG signals on the subject's finger. The sampling rate of the reference EDA and PPG signals was 512 Hz.
The subjects were in the age group of 21 to 35 years old, the average age was 26.7. The experimental protocol was carefully explained to the subjects upon arrival. The experiment consisted of 32 trials. In each trial, a 30-s-long neutral video clip was shown to neutralize the former emotional state of the subject. Then, 2-min-long emotional video clips were shown followed by a 2-s-long black screen. The subjects were requested to rest for 5 min after completing 16 trials to minimize the effect of stress on the physiological signal. The total trial was 1 h and 32 min long, but the entire process including preparation took ~2 h. All the procedures in the experiment were certified by the Institutional Review Board of KIST-2016-013.

F. Data Processing
The acquired raw data were processed in a traditional feature-based machine-learning manner. The emotion-related features were extracted from each raw data channel. The features were extracted from a 120-s long biosignal per trial. The observation length for feature extraction was 100-s long, and the features were extracted by moving the observed region in increments of 5 s. Therefore, three feature vectors were acquired from each trial, and, in total, 160 feature vectors were acquired for each subject. Two types of features were extracted: statistical features and domain-related features.
The statistical features were extracted from the raw signal regardless of the acquired channel. The commonly used statistical features extracted from raw signal are: (1) average value, (2) standard deviation, (3) mean of absolute value of 1st difference, (4) mean of absolute value of 2nd difference, (5) ratio of mean of absolute value of the 1st difference and standard deviation, and (6) ratio of mean of absolute value of 2nd difference and standard deviation.   Table 2 shows the domain-related features that were processed differently for each channel. For the PPG signal, the average acceleration [14] of the pulse and heart rate variability (HRV)-related features were extracted. The HRV features are extracted mainly from the peak-to-peak (PP) interval and power spectral density (PSD) of the PP interval. Fig. 9 shows the PPG signal with PP interval and the corresponding PSD. In the EDA signal, most features were related to the skin conductance response (SCR) [35]. Two low-pass filters were used to acquire the SCRs in different time resolutions. First, the raw signal was passed through a low-pass filter with 0.2-Hz cut-off frequency, which was called the skin conductance slow response (SCSR). Second, a low-pass filter with a 0.08-Hz cut-off frequency was applied, and the signal was called the skin conductance very slow response (SCVSR). The SCRrelated features were extracted from each pre-processed signal. we extracted the number of the SCR in the SCSR and the SCVSR, and the average amplitude of the SCR in the SCSR and the SCVSR. and we also extract the recovery time of the SCR in SCSR.
After extracting the biosignal features, a feature-selection method was applied to filter the features that did not vary between emotional states. The selection algorithm used was ReliefF [36].  To process the facial expression response, we used the Fisherface method [37]. We used this method because this method does not need facial landmarks that can't be acquired in our local facial expression. The labels were determined according to the clip the subject watched, regardless of whether the subject actually expressed emotion. First, principal components analysis (PCA) was applied to reduce the dimensionality of the captured images. The principal components and the eigenvalues were extracted from the training dataset and sorted based on the magnitude of the eigenvalues. We rejected the eigenvectors with the largest two eigenvalues because the components that have large eigenvalues typically describe illumination changes [37], not expression changes. Additionally, we also rejected components with small eigenvalues to remove noise. The smallest eigenvectors that summation of its eigenvalues accounts for 15% of the total summation were rejected.
After obtaining the refined eigenvectors , The pixel values of the captured images were mapped on to the selected components. The projected values were analyzed using linear discriminant analysis (LDA) to extract more discriminant features based on the user's emotional state. A weight matrix W was obtained from the LDA, which satisfies the following expression [37]: The extracted feature vectors were used for training emotion recognition model. We used support vector machine with RBF kernel as emotion recognition model. The kernel size of the SVM was 0.5 for the facial expression and 0.2 for other modalities.
For recognizing the arousal and valence states, we used the binary RBF SVM model. For recognizing quadrant emotional states, we implemented a two-level classification model because combining binary classifiers in multiple levels could be a more optimal solution for recognizing quadrant emotional states that can be decomposed into arousal and valence states [38]. Therefore, we first recognized the user's arousal and valence states separately and then combined each state to estimate the user's quadrant emotional state. Our wearable device can acquire the user's emotional response from biosignals and facial expressions concurrently. To recognize the emotional state from these two individual sources of information, we fused the acquired biosignal and facial expression features. The most critical challenge in combining these features was that the sampling rate in each channel was different. The built-in camera module can capture 500 sequences of images while extracting a biosignal feature vector within a 100-s long biosignal. Therefore, we attempted to extract a representative expression image from the sequential images. However, simple averaging is not effective because there are so few meaningful expressions in the sequence of expression images as the emotional expressions occur in a short period of time. Particularly, the initial sequences contain few emotional expressions. This is because we designed the stimulus video clip to gradually induce the user's emotional state. Therefore, we designed the algorithm to extract meaningful expression data from the sequence of facial images with respect to the biosignal features. Algorithm 1 describes the extraction of meaningful expression data. We first divided the user's eigenface scores into two groups using the k-means clustering algorithm. Then, we computed the group membership of the sequence in an initial 30 seconds to find the group with emotional expressions. Finally, the representative fisherface was obtained by averaging the emotional expression group. The obtained fisherface score was used to match facial expression features.

G. Feature fusion and classification
Finally, we validated the feasibility of our wearable system by comparing emotion recognition performance. There are four validations for measuring the performance. the first was the accuracy using reference biosignal. these signal include PPG, GSR signal from user's left hand which acquired by the Biosemi ActiveTwo. the second one was the validation using wearable biosignal and the facial expression independently, the last one was the accuracy using the feature-fused modality.
We validated each channel by a leave-one-trial-out manner. Compared to [39] which sampled the training and test set in a random 10-fold split. We evaluated the data in 10-fold crossvalidation with random sampling. This data split has a problem that extracted training data, and test data were close in time. This temporal similarity could lead to a high correlation between test and training data set. Therefore, to prevent this temporal similarity problem, we set the data at one trial as a test dataset and let other datasets for training and repeat this validation for all trials.

FIGURE 11 Correlation coefficients between fingers and candidate facial sensor placements
In the subsection of the method, we experimented EDA correlation between facial parts and reference locations. The results of the EDA measurement experiments shows that the highest correlation with fingers was shown when the sensors are placed near the nose and mastoid. The average correlation from the mastoid and nose is 0.6757, which is higher than those in other locations, and the standard deviation between subjects is 0.1663, which is lower than those in other locations. This result explains why the EDA electrodes are placed on the mastoid and nose contact parts of the glasses. FIGURE 12 The emotion recognition rates for arousal state estimation for each subject. The dashed black line indicates random chance probability.

FIGURE 13
The emotion recognition rates for valence state estimation for each subject. The dashed black line indicates random chance probability.  Table 3 presents the average accuracies for each channel and the fusion method. Although the recognition performance in the fusion method is not always better than that of other methods, its average score was better than those other wearable modalities.    15 shows samples of the acquired fisherface. In the acquired facial expression, it appears that the eyebrow and upper cheek regions are highly activated compared to other regions. The fisherface for the valence state shows higher activation at the laughter line near the mouth compared to the fisherface for the arousal state. Table 5 Average Emotion Recognition Accuracies for each gender groups. The number inside parentheses indicates the number of target emotional states.

FIGURE 14 The emotion recognition rates for quadrant state estimation for each subject. The dashed black line indicates random chance probability
Arousal  Table 5 shows a comparison of the average emotion recognition rates between the male (n=10) and female (n=10) participants. The results indicate that the female participants show better emotion recognition rates for facial expressions, which is consistent with the results in [40], which is implying that women use facial expressions more frequently than men. It is observed that the system can recognize emotional states more easily from frequent facial expressions. On the contrary, male participants show better emotion recognition rates for facial biosignal than female participants.

III. Discussion
Existing wearable devices for emotion recognition may not yield good and stable results due to dynamic situations in real life and individual difference of emotional response. To solve this problem, this study proposed a new wearable system to enhance the emotion recognition performance by using physiological signals and facial image simultaneously.
Multi-modal wearable device has a strength that can be applied to various situation in real life. beside the improvement of recognition accuracy, using two modalities have an advantage for example; when situation that the facial expression didn't reveal effectively, such as huge illumination change, the biosignal could compensate the facial expression, on the contrary, the facial expression could compensate the biosignal in some condition that the biosignal couldn't work effectively such as a condition that user having a cognitive stress. Additionally, the results in Table 5 indicate that multimodal approach could be robust to gender bias of the emotion recognition. Table 5 shows that biosignal-based emotion recognition rates were higher in male users, whereas the facial expression-based emotion recognition rates were higher in female users. Proposed wearable device could be reasonable approach to compensate gender-based bias, because it use both modalities, expands the range of selective use of advantageous features according to gender.
We designed the glasses-typed wearable device to get the advantage of multi-channel wearable device. To find optimal facial location for measuring the biosignal, we conducted EDA signal acquisition experiments by comparing EDA signal correlation with fingers and the facial locations. The results indicate that it is best to place EDA sensors on nose and mastoid than noses only and mastoids only. The higher correlation might be obtained if we place the EDA sensors on other locations such as forehead [30]. However, other locations except nose and mastoid might could result noisy EDA signal due to skin movement by facial expression, which was experimented in preliminary study [31]. In this study, we found best EDA sensor locations within the reasonable locations which can be used in facial wearable devices.
According to the EDA correlation experiment results, the wearable device was designed to measure the sweat response from the mastoid and nasal skin. The local facial expression was acquired by built-in camera on side of user's face. The initial motivation of our study was the compensation effect between each channel. unfortunately, these compensation effect didn't fully covered in the experiment because we didn't controlled user's situation change such as giving a stress [8], we will experiment to prove that multimodal wearable device has robust emotion recognition performance in various situation. Although the experiment did not cover the advantages of the fusion modality in situational changes, it validated the benefits of the fusion modality with respect to personal differences. We believe that the experimental results shown in Figs. 11-13 agree with the main hypothesis of our study. As shown in the figures, the use of single modality is not always suitable for all subjects. For example, subject 19 shows poor recognition performance using biosignal, but subject 24 shows poor recognition performance using facial expression, but the another channel compensated these poor emotional responses. Table 1 shows the overall accuracies. although this compensatory effects of the fusion modality are not shown in all subjects, these effects are clear in the overall average performance. The fusion of two modalities shows better performance than any of the single modalities. The classification performance increased by 8.46% compared to the method using biosignal alone. Additionally, as we observed during the experiment, the individual differences were found. some subjects remained mostly expressionless and some subjects rarely presented EDA responses during the emotion inducing experiment.
The confusion matrix in Table 4 shows that using a combination of biosignals and facial expressions reduces the confusion of the negative emotional states (i.e., disgust and depressed). For example, confusion cases that predict the disgust state as the funny state or cases that predict the depressed state as the calm state are clearly lesser in number when a fusion of modalities is used compared to single modalities, while the other number of confusion cases are between the number of two single modalities. This may indicate that the fusion of two facial modalities especially has a synergistic effect for detecting negative emotional states. Therefore the fusion of two channels could lead to more accurate monitoring of negative emotional state for healthcare such as monitoring depressed state [19]. Fig. 16 shows the example application of our emotion recognition system for the mental healthcare. In this application, the ratio of quadrant emotional states are logged in order to monitor subtle and complicated emotional change in real world. User can observe daily emotional state patterns using the logged emotional state, and get more detailed assistance from the mental healthcare.
Compared with other emotion recognition studies based on facial expressions, the system proposed herein shows a lower emotion recognition performance [23], [41], [42]. However, most comparative studies on wearable devices classified data created by intentional facial expressions rather than data created by eliciting actual emotions. Therefore, we suggest that the lower accuracy is due to the differences in the experimental paradigms. Compared with other studies recognized induced emotional state, our study shows higher emotion recognition rates for valence state and arousal state classification [24].
Comparing with existing multimodal emotion recognition studies [32], [35], the existing studies measured biosignals or facial expressions at a standard measurement location. For example, the sweat response uses the EDA response of the finger and the facial expression uses a frontal view of the full face. Our study, on the other hand, measured the user response in areas confined to the face, such as facial expressions around the eyes or sweat responses of the nasal skin. Therefore, the modalities used in our experiment differ from those considered in existing multimodal emotion recognition studies. Our study aims to provide the support to multimodal emotion recognition researchers and facial wearable device developers in order to explore more real-world applicable modalities.

IV. Conclusion
In this study, to increase the reliability of emotion recognition, we propose a glasses-type wearable device that measures local facial expressions in addition to biosignal. The facial expressions were acquired in an unobtrusive manner using a camera, and the location of the biosensors in the wearable device was determined via signal measurement experiments. Experiments using video clips were conducted to evaluate the performance of the device. Our results show that the glasses-type wearable device can be used to estimate a user's emotional state accurately.
A limitation of our study is that there was no scale to measure the amount of emotion induced by the stimulation. Although we carefully collected emotional state to induce exactly one emotional state, which might increases the ambiguity of the labeled data. We recommended that a dominance scale [33] should be adopted the experiment, which will be helpful to develop more accurate emotion recognition models in future studies. The results of our study can be of value to other researchers studying multimodal emotion recognition in wearable devices.
Additionally, the proposed method can expand the range of healthcare applications using wearable devices such as monitoring both physical and mental health states. In the future, we expect to develop a wearable device capable of monitoring a user's emotions with respect to various tasks in everyday life.