Micro-Expression Recognition Based on Spatio–Temporal Capsule Network

Micro-expression (ME) is a subtle and spontaneous natural facial mechanism that is one of the essential psychological stress responses. Due to its accuracy and uncontrollability for psychological expressions, ME has crucial applications in many psychologically related fields. However, due to small data volume and high data redundancy, existing micro-expression recognition (MER) methods cannot balance accuracy and recognition speed. Therefore, we propose a deep learning algorithm based on a spatio–temporal capsule network (STCP-Net). It reduces recognition time while ensuring the accuracy rate. STCP-Net consists of four parts: a jitter removal module, a differential feature extraction module, a spatio–temporal capsule (STCP) module, and a fully connected layer. The first two modules extract diversifying differential features more accurately and reduce the impact of head jitter. The STCP module extracts spatio–temporal features progressively layer by layer, fully considering the temporal and spatial relationship between features. Finally, this study conducts sufficient experiments based on the Leave One Subject Out (LOSO) cross-validation protocol with standard datasets. The final results and analysis show that the algorithm is advanced and effective.


I. INTRODUCTION
Human facial expression is one of the important ways of expressing emotion and a direct way of communicating emotion with people. Facial expressions can be subdivided into macro-expressions and MEs according to the time and space scales. Macro-expressions have a duration between 0.75 and 2 seconds, and the amplitude of facial muscle movements varies according to the magnitude of the emotion. MEs have a duration between 0.04 and 0.2 seconds, and facial muscle movements are often very slight [1], [2].
MEs possess an important spontaneity compared to macro-expressions, causing the expresser to involuntarily communicate his or her hidden genuine emotions to the outside world. Due to this feature, MER has received significant attention in health care, criminal investigation, and The associate editor coordinating the review of this manuscript and approving it for publication was Wenbing Zhao . national security [3], [4]. However, it is difficult for humans to detect and recognize MEs without relevant special training, so Ekman developed ME training tools [5] to help people learn to recognize MEs. However, the recognition accuracy and efficiency are still unsatisfactory after simple training.
The development of computer technology has led researchers to step into the era of using algorithms to recognize MEs. For example, local binary patterns from three orthogonal planes (LBP-TOP) [6], [7] proposed by Zhao et al. and 3D gradient descriptors [8] proposed by Polikovsky et al., achieved good results in the early stage of research. For the healthy and orderly development of MER tasks, researchers have continuously produced standard datasets, such as the China academy of sciences micro-expression database (CASME) [9], [10], [11], [12], spontaneous micro-expression database (SMIC) [13], and spontaneous micro-facial movement dataset (SAMM) [14]. Furthermore, the world has successfully held five micro-expression recognition grand challenges (MEGC) being held [15], [16], [17], [18], [19]. That not only clarified the development goals of ME tasks but also significantly promoted the development speed of MER.
It can be found based on the observation of MEs: although the ME process is brief, it still has redundant information [20], [21]. In addition to frame number redundancy, there is also redundant information about the face because of the slight facial muscle movements [1], [2]. To reduce the impact of redundant information, the researchers used LBP-TOP and Optical flow to perform dimensionality reduction and de-redundancy operations on the data. These handcrafted features play a specific role, but at the same time, they also dilute and distort some features in the ME sequence. To solve this difficulty, the researchers used handcrafted features and ME frames as the input of the neural network model simultaneously, improving the accuracy but increasing the algorithm's time complexity. To further reduce the redundant information while decreasing the algorithm's time complexity, we propose STCP-Net. STCP-Net uses a differential approach to remove the static information of the face, which can extract the dynamic information between frames more effectively, and then use the STCP module to extract the spatial information within frames and the temporal information between frames. In addition, to reduce the influence of head shaking on the differential feature part, we design a head debounce network (Debouncer) that can micro-align the face frames.
In summary, our main contributions are as follows: 1) We propose the STCP module for the first time. For the unique properties of the capsule network, we extend the ordinary capsule network into a multi-layer capsule network that can learn spatio-temporal information. Finally, after experiments, it can be proved that the module is more effective than the ordinary capsule network. 2) A reliable frame alignment network is designed for STCP-Net, which can align the same features of different frames and improve the model's overall performance. 3) We design an end-to-end model acting on ME sequences, experiments are conducted on standard datasets, and the results are compared with other currently state-of-the-art methods. The final results show that the method can effectively remove redundant information from MEs, improve recognition speed, and outperform existing state-of-the-art methods.

II. RELATED WORK A. EXISTING RECOGNITION METHODS
This section reviews some existing MER methods. Based on different feature extraction methods, MER papers can be classified into three categories: handcrafted methods, handcrafted-deep learning, and deep learning-based. Handcrafted features are mainly divided into two categories: LBP-TOP and optical flow. LBP-TOP is a highdimensional feature extraction algorithm proposed by Zhao et al. for improving the local binary patterns (LBP) algorithm [6], which considers and records dynamic texture information in the X-Y, X-T, and Y-T dimensions. Pfister et al. implemented the first facial spontaneous MER system using the LBP-TOP algorithm [7]. Wang et al. proposed LBP with six intersection points (LBP-SIP) to reduce redundant information in LBP [22]. Guo et al. combined LBP-TOP with nearest neighbors to obtain a rational solution [23]. Wang et al. did more exploration on the combination of LBP with color space [24], [25]. Furthermore, Huang et al. proposed to combine integral projection with one dimension LBP to obtain a spatio-temporal local binary pattern with integral projection (STLBP-IP) algorithm [26]. The optical flow method is an effective method to extract the motion state of spatial objects, as well as in MER, the motion direction of facial muscles can be recognized [27], [28]. Happy combines the optical flow method with the fuzzy histogram and gets good results [29]. Liong et al. used the onset and apex frames to calculate optical flow and proposed Bi-weighted oriented optical flow (Bi-WOOF) [30].
Handcrafted-deep learning uses the neural network and the manual operator as the feature extractor of the algorithm. Su et al. extracted first-order motion, second-order motion, and optical flow from the onset and apex frames. Then, they used a shallow residual network for further feature extraction [31]. Wang et al. used the convolutional network to extract features from the optical flow sequence and the original image, then performed post-fusion [32]. Zhao et al. obtained more distinct optical flow features after amplifying the motion information of the frames by a motion amplification operation. Finally, they extracted the features using a convolutional neural network [33]. Moreover, Song et al. used a three-stream convolutional network to extract features in optical flow features, apex frames, and cut apex frames separately [34].
Deep learning-based can be divided into three categories depending on the input data. The first category is a model with only apex frame as input [35], which extracts the facial texture features shown by ME. The second category is a model with input onset and apex frames [20], which extracts the facial texture and optical flow. The last category is the input frame sequence to extract the geometric change features of ME [36], [37], [38], [39]. Compared with single-frame and double-frame, the input multi-frame MER methods can extract more spatio-temporal information of MEs and show a better recognition effect.

B. DE-REDUNDANT INFORMATION ANALYSIS
It can be seen that in order to extract the spatio-temporal information in MEs effectively, scholars have conducted research on de-redundancy information from multiple perspectives. LBP-TOP extracts dynamic textures from frame sequences, but this method also has an excessive time complexity. To solve this problem, researchers have proposed a series of methods, such as LBP-SIP, and the GPU version VOLUME 11, 2023 of LBP-TOP, to reduce the time complexity. However, its processing time is still higher than deep learning methods. Table 10 provides strong evidence for this conclusion. The optical flow method extracts the motion direction of an object by calculating the movement of pixels between two frames. It can be used in motion detection, compensation coding, and other areas. However, this method will lose some essential details for facial MEs, so researchers will add the analysis of the original frames while using the optical flow methods, and the recognition effect is vastly improved. The relevant conclusions can be seen in the comparison of experimental results in Section IV. It is worth mentioning that Li et al. proposed a method to extract the valid frames of MEs in the video by computing regions of interest (ROIs) through the dedicated local temporal pattern (LTP), which reduces the redundant information in the video [40].
Compared to handcrafted operators, pure deep learning algorithms have little research in the de-redundant information. Lei et al. [20] used MagNet to extract Magnified shape features from the onset and apex frames, analyzed the location of key points of the face by Graph learning model, and fused facial action units (AU) for classification. Compared with the optical flow method, this method preserves the detailed features of the face to a greater extent. Bai et al. used a VGGFace2 network to extract facial features for each frame in the sequence and then used LSTM to extract the timing information of the feature sequence [36]. Inspired by the above two papers, we propose a new idea for the MER algorithm. That is, we first extract the features of each frame using the Debouncer, extract the motion information of the same position in different frames, and finally extract the temporal features and classify them using an effective spatio-temporal feature extraction network.

III. STCP-NET
Firstly, we describe the general architecture of STCP-Net. Secondly, we introduce the individual modules in detail. Our method is divided into two main parts: the extraction of diversifying differential features and spatio-temporal features. As an essential pre-work for the second part, the first part must ensure that valid facial dynamic information is extracted, so we describe the framework structure of the Debouncer and the differential feature extraction module in the first part. In the second part, we describe in detail the composition structure of this innovative spatio-temporal feature extraction method-STCP module and how it works.
As shown in Fig.1, we input a sequence of ME frames split into 3 × 3 blocks, obtain alignment features through the Debouncer, and add a channel attention mechanism to the alignment features. Then we use the differential feature extraction module to extract and fuse the diversifying differential features. Finally, we use the STCP module to analyze and extract the differential features and classify them.

A. EXTRACT DIVERSE DIFFERENTIAL FEATURES
During facial ME, the dynamic areas contain dynamic information, and the stationary areas contain static information.
The LBP-TOP algorithm, the optical flow method, and the facial dynamic map (FDM) [28] algorithm give up the static information of the face and focus on the dynamic information of the face. Many deep learning algorithms incorporate optical flow methods to extract dynamic information [27], [29], [41]. Algorithms that use multiple frames as input data are trying to extract and distinguish this dynamic information more effectively. It has been proved in [42] and [43] that the algorithm input with multi-frames is more accurate than only the apex frame. Inspired by this, we abandon the static information in the ME sequences. Then, we use deep learning and the difference method (DM) to extract the dynamic information in the ME sequences.
We take the ME sequence as input to extract dynamic information; the onset frame is the first frame of the frame sequence, and the apex frame is the last. Song et al. [34] demonstrated that block-based facial segmentation could improve feature extraction. So, we spilt the frame of size 224 × 224 into 3 × 3 blocks of size 74 × 74 before extracting features using the Debouncer. To prevent errors caused by discarding boundary information, we expand the blocks using 0 values several times in the network to reduce the impact of jitter in the face. Table 1 shows the specific network structure.
Finally, we perform the difference operation between the acquired alignment features and the first frame's corresponding alignment features to extract the features' dynamic information. However, the information between the two frames is a complex texture change. In order to distinguish the increase or decrease of texture, we perform the Relu operation on the obtained differential feature to obtain the differential feature F d and the inverse to the differential feature F antid , that is: where Relu is defined as follows: where F i is the feature of the i-th frame extracted by the Debouncer. F onset is the feature of the onset frame extracted by the Debouncer. In Fig.2, we show the effect of diversifying differential features. After extracting the alignment features, it is difficult for the naked eye to observe the differences between the alignment features of each frame. However, after extracting F d and F antid , we can observe that the face is always in motion. Group (c) shows the dynamic information of the mouth and the eye, and group (d) extracts the motion of the glabella and left chin regions. Fig.2 shows that the diverse difference method can effectively remove redundant information and improve the expressiveness of the model, which also shows the importance of the reverse differential features.

B. EXTRACTION OF SPATIO-TEMPORAL FEATURES
To extract practical spatio-temporal features from the dynamic information, we design a multi-layer capsule structure based on the capsule network for MER.
Mazzia et al. [44] proposed a capsule network to address the lack of spatial information in CNNs. The capsule network uses vector groups to represent the instantiated parameters of features. The weights of the low-level and higher-level capsules are obtained by dynamic routing to rate-code the probability of objects in space and stored in the top capsules as vectors. Mazzia et al. [44] abandoned the original routing mechanism and proposed a capsule network based on a self-attentive routing mechanism, allowing the capsule network to perform well with only two percent of the original parameters.
This module takes the property of the capsule network to extract spatial information attributes as the starting point. Capsules are extracted from block features and gradually extended to spatial and temporal features. Fig.3 shows the complete STCP module architecture. First, we extract the low-level capsules from the block features using three-layer convolution operations. Then, higher-level capsules based on block features are obtained by routing using the Efficient-Capsnet algorithm proposed by Mazzia et al. [44]. So far, we have extracted multiple higher-level capsules on each frame-block, and these high-level capsules store various entity attributes. At this point, our data can be represented as a matrix of s × k × n 0 × d 0 , where s is the number of frames in the ME sequence, k is the number of blocks a single frame is divided into, n 0 is the number of higher-level capsules in a single block, and d 0 is the dimension of a single capsule.
We fuse all the high-level capsules of a single block in each frame into a single capsule so that a block capsule (BC) contains all the properties of this block. By continuing to route the k BCs in a frame, we can obtain multiple higherlevel capsules containing the spatial information of a single frame. At this time, the data can be represented as a matrix of s × n 1 × d 1 , where n 1 is the number of higher-level capsules contained in a single frame and d 1 is the dimension of a single capsule. Following this method, continue fusing all higher-level capsules in a single frame into one capsule to obtain s space capsules (SC), each containing all the spatial information in a single frame. Then, multiple SCs are routed to obtain a capsule containing Spatial-temporal information (STC), where the data can be expressed as n 2 × d 2 , where n 2 and d 2 are the number and dimensions of STC.

IV. EXPERIMENT
This section details the dataset we use, the implementation details, and a comparison of the experimental results. Example of differential features. Group (a) is the original frame; group (b) is the alignment feature; group (c) is the differential feature; group (d) is the inverse differential feature; for both groups (a-b), each column is the onset frame and its alignment feature; for groups (c-d), each column is the differential feature obtained by the operation of the alignment feature with the first column.

A. DATASET
We conduct extensive MER experiments on the CASME II and SMIC datasets to evaluate STCN-Net. We use the LOSO cross-validation protocol for all experiments. In each epoch, the validation set is one of the subjects, and the training set is the rest of the subjects until all subjects become over the validation set. The final UAR, F1, and accuracy are calculated using the sum of the confusion matrices generated from multiple independent validation sets. The LOSO cross-validation protocol can reduce the error caused by overfitting and ensure the reliability of the experimental results. It is a commonly used cross-validation protocol for MER.

1) CASME II
The CASME II dataset was collected by Yan et al. [10] at the Institute of Psychology, Chinese Academy of Sciences. CASME II contains 255 samples collected in the laboratory by 26 subjects. Table 2 shows the details of the data used in this paper.

2) SMIC
The SMIC dataset was collected by Li et al. [13] at the university of Oulu. SMIC contains 164 samples collected in the laboratory by 16 subjects. Table 3 shows the details of the data used in this paper.

B. EXPERIMENTAL CONFIGURATION
The experiments choose Adam as the optimizer and configure the learning rate of 1e − 4, which decays as the number of iterations increases. The cross-entropy is chosen as the loss function, and its formula is as follows:   where, probability expectation p is the expected output, and probability distribution q is the actual output. Table 4 shows additional configurations.

C. EXPERIMENTAL RESULTS
This subsection shows three-category, four-category, fivecategory, and ablation experiments on the CASME II and SMIC datasets. Respectively, Fig.4 shows the main confusion matrices. We compare STCP-Net with other state-of-the-art methods, all of which use the LOSO protocol. Tables 5, 6, 7, 8, 9 and 10 compare UAR, Accuracy, and F1 scores with the following equations.
where, K is the total number of categories, M is the confusion matrix, Recall i is the recall of a category, and Preision i is the precision of a category.

1) THREE-CATEGORY EXPERIMENT ON SMIC DATASET
As shown in Table 5, we detail the performance of STCP-Net in three-category and compare it with other methods using the LOSO protocol. In three-category, our STCP-Net has an accuracy of 77.27%, UAR of 0.7716, and F1 score of 0.7747 on SIMC-HS. Moreover, we have an accuracy of 84.51%, UAR of 0.8462, and F1 score of 0.8429 on SIMC-VIS. Compared with the state-of-the-art methods TSNN-LF [45], KFC-MER [31], and AU-GACN [37], our F1 improved by 0.0826, 0.1109, and 0.0555, respectively. It shows that our STCP-Net method in three-category improved recognition accuracy. We notice that STCP-Net performs better on SMIC-VIS than SMIC-HS. After comparing and summarizing, we find that this is due to the color range of the SMIC-VIS dataset and the extraction of critical dynamic frames. That also demonstrates the advantage of the STCP-Net in handling small datasets.
2) THREE-CATEGORY EXPERIMENT ON CASME II DATASET As shown in Table 6, we detail the performance of STCP-Net in three-category and compare it with other methods using the LOSO protocol. In three-category, our STCP-Net has an accuracy of 91.46%, UAR of 0.9016, and F1 score of 0.8977. Compared with the state-of-the-art methods EMR [46], STST-Net [47], and Dual-Inception [48], our F1 improved by 0.0684, 0.0595, and 0.0356, respectively. The comparison shows that our STCP-Net method is highly competitive in four-category.

3) FOUR-CATEGORY EXPERIMENT ON CASME II DATASET
As shown in Table 7, our STCP-Net has an accuracy of 76.04% and F1 of 0.7601. Compared with the state-of-theart method Bi-WOOF [30], DSTAN [32], and FR [49], our accuracy improves by 17.14%, 0.84%, and 7.66%, respectively. The accuracy is lower than STRCN [39], but F1 is improved by 0.0131, respectively. The comparison shows that our STCP-Net method is highly competitive in four-category.

5) ABLATION EXPERIMENT
To verify the effectiveness of differential features and STCP-Net, we perform extensive ablation analysis on the CASME II dataset, presented in Table 9. We compare this algorithm with five variants: no extraction of differential features after Debouncer (Without-Difference); using differential features in Debouncer but no Relu operation (Without-Relu); using differential features in Debouncer and performing Relu operation but no extraction F anti . ( Without-Anti); after extracting the BC, the extracted temporal information operation is performed first, followed by the extracted spatial information operation (TSCP-Net).

6) COMPARISON OF TIME
We also compare other algorithms in terms of running speed and display them in Table 10. The running time in this paper is 0.38 seconds per sample on the CPU and 0.015 seconds per sample on the GPU, and Section IV-B shows the running environment. Compared to the handcraft-based method, it can be seen that the deep learning-based MER method significantly improves running speed and accuracy rate. Compared with the state-of-the-art deep learning-based algorithms Residual Network and MA, the running time of STCP-Net is reduced by 0.57s and 0.72s, respectively.

V. CONCLUSION
This paper proposes a spatial-temporal capsule network (STCP-Net) for MER. We conduct extensive experiments and comparisons on publicly available spontaneous datasets to evaluate the proposed method. The experimental results show that our method improves the recognition accuracy and dramatically increases the recognition speed.
In addition, this paper explores a problem that has not received much attention in current research. A significant feature of ME distinguishing it from macro-expressions is that the action amplitude is very subtle, which makes most macro-expression recognition algorithms inappropriate for ME. Fundamentally, the amplitude of movements in MEs is much smaller than in macro-expressions, which results in most of the facial information in ME sequences being redundant, and too much redundant information leads to the degradation of algorithm performance. Thanks to the introduction of differential features, our method dramatically reduces the redundant information in ME sequences, ensures the effectiveness and simplicity of the features, and reduces the number of model parameters to 66 million.
Although this method has a speed improvement compared to the optical flow method, it also has higher requirements for input data. The head of the input ME sequence must not have significant jitter. In other words, the method can achieve its maximum performance when the head jitter can be reduced or even when the heads of multiple frames can be perfectly aligned. How to reduce head jitter better is also one of the directions we plan to study in the future.
In summary, although the algorithm in this paper has achieved good results, there are still areas for improvement. For example, continue optimizing Debouncer to correct the head jitter in a more extensive range; optimize the differential feature extraction method to improve the diversity of the features.