Facial Micro-Expression Recognition Enhanced by Score Fusion and a Hybrid Model from Convolutional LSTM and Vision Transformer

In the billions of faces that are shaped by thousands of different cultures and ethnicities, one thing remains universal: the way emotions are expressed. To take the next step in human–machine interactions, a machine (e.g., a humanoid robot) must be able to clarify facial emotions. Allowing systems to recognize micro-expressions affords the machine a deeper dive into a person’s true feelings, which will take human emotion into account while making optimal decisions. For instance, these machines will be able to detect dangerous situations, alert caregivers to challenges, and provide appropriate responses. Micro-expressions are involuntary and transient facial expressions capable of revealing genuine emotions. We propose a new hybrid neural network (NN) model capable of micro-expression recognition in real-time applications. Several NN models are first compared in this study. Then, a hybrid NN model is created by combining a convolutional neural network (CNN), a recurrent neural network (RNN, e.g., long short-term memory (LSTM)), and a vision transformer. The CNN can extract spatial features (within a neighborhood of an image), whereas the LSTM can summarize temporal features. In addition, a transformer with an attention mechanism can capture sparse spatial relations residing in an image or between frames in a video clip. The inputs of the model are short facial videos, while the outputs are the micro-expressions recognized from the videos. The NN models are trained and tested with publicly available facial micro-expression datasets to recognize different micro-expressions (e.g., happiness, fear, anger, surprise, disgust, sadness). Score fusion and improvement metrics are also presented in our experiments. The results of our proposed models are compared with that of literature-reported methods tested on the same datasets. The proposed hybrid model performs the best, where score fusion can dramatically increase recognition performance.


Introduction
Facial expressions serve as a universally understood form of human communication intimately linked to one's mental states, attitudes, and intentions. In addition to the typical facial expressions displayed in daily life, known as macro-expressions, there exists a distinct category called micro-expressions. These micro-expressions emerge in specific conditions, unveiling people's concealed emotions during high-stakes situations when they strive to mask their true feelings [1,2]. Unlike macro-expressions, micro-expressions are involuntary, spontaneous, subtle, and rapid, lasting typically between 40 and 450 ms. They are instinctive facial movements that react to emotional stimuli [3,4]. While individuals can consciously conceal or restrain their genuine emotions through macro-expressions, micro-expressions are beyond their control, inevitably exposing the authentic emotions they experience [5][6][7][8].
Recognizing micro-expressions is a daunting task as they are fleeting, involuntary, and exhibit low intensity. Only extensively trained experts possess the ability to discern There are two datasets used in this study, which are briefly described in this sub-section. The MMEW dataset [15] follows the same elicitation paradigm used in other published datasets [19][20][21][22][23], i.e., watching emotional video episodes while attempting to maintain a neutral expression. The full details of the MMEW dataset construction process are presented in Reference [15]. All samples in MMEW were carefully calibrated by experts with onset, apex, and offset frames, and the action unit (AU) [15] annotation from the Facial Action Coding System (FACS) [24] was used to describe the facial muscle movement area. MMEW contains 300 micro-expression samples (image sequences). The samples in MMEW have a large image resolution (1920 × 1080 pixels). Furthermore, MMEW has a facial image size of 400 × 400 pixels. MMEW has seven elaborate emotion classes (see Figure 1), i.e., Happiness (36), Anger (8), Surprise (89), Disgust (72), Fear (16), Sadness (13), and Others (66).
are presented in Reference [15]. All samples in MMEW were carefully calibrated by ex-perts with onset, apex, and offset frames, and the action unit (AU) [15] annotation from the Facial Action Coding System (FACS) [24] was used to describe the facial muscle movement area. MMEW contains 300 micro-expression samples (image sequences). The samples in MMEW have a large image resolution (1920 × 1080 pixels). Furthermore, MMEW has a facial image size of 400 × 400 pixels. MMEW has seven elaborate emotion classes (see Figure 1), i.e., Happiness (36), Anger (8), Surprise (89), Disgust (72), Fear (16), Sadness (13), and Others (66). The Chinese Academy of Sciences Micro-expression CASME II [22] dataset was developed in a well-controlled laboratory environment, where four lamps were chosen to provide steady and high-intensity illumination. To elicit micro-expressions, participants were instructed to maintain a neutral facial expression when watching video episodes with high emotional valence. CASME II used a high-speed camera with a sampling rate of 200 fps. There are 247 image sequences, which consist of facial images of 280 × 340 pixels. CASME II contains the samples of five emotion classes (see Figure 1), i.e., Happiness (33 samples), Repression (27), Surprise (25), Disgust (60), and Others (102).

Facial Image Standardization and Normalization
All facial images have been extracted from both datasets (as shown in Figure 1). Face detection is unnecessary in this study as the dataset contains one face per image.
General standardization is applied to all facial images, which is defined as follows.
, (1) μ /σ, The Chinese Academy of Sciences Micro-expression CASME II [22] dataset was developed in a well-controlled laboratory environment, where four lamps were chosen to provide steady and high-intensity illumination. To elicit micro-expressions, participants were instructed to maintain a neutral facial expression when watching video episodes with high emotional valence. CASME II used a high-speed camera with a sampling rate of 200 fps. There are 247 image sequences, which consist of facial images of 280 × 340 pixels. CASME II contains the samples of five emotion classes (see Figure 1), i.e., Happiness (33 samples), Repression (27), Surprise (25), Disgust (60), and Others (102).

Facial Image Standardization and Normalization
All facial images have been extracted from both datasets (as shown in Figure 1). Face detection is unnecessary in this study as the dataset contains one face per image.
General standardization is applied to all facial images, which is defined as follows.
where I N is the normalized facial image, I M is the mean image of all faces in the dataset, and I N is their difference image. I S is the standardized image; µ and σ denote the mean and standard deviation of the I N image, respectively. Image normalization (intensity scaling) is required by neural network models, which is defined as where I N is the normalized image, I 0 is the original (input) image; I Min and I Max are the minimum and maximum pixel values in I 0 , while L Min and L Max are the desired minimum and maximum pixel values in I N . For example, we may select L Min = 0 and L Max = 1.0.

Convolutional Neural Networks
Convolutional neural networks (CNNs) draw inspiration from the biological functioning of the visual cortex, where small groups of cells exhibit sensitivity to specific regions within the visual field. CNNs blend principles from biology, mathematics, and computer science, making them pivotal innovations in the domains of computer vision and artificial intelligence (AI). The year 2012 marked a significant turning point for CNNs when Krizhevsky et al. [25] utilized an eight-layer CNN (comprising five convolutional layers and three fully-connected layers) to secure victory in the ImageNet competition. This groundbreaking achievement, known as AlexNet, reduced the classification error rate from 25.8% in 2011 to an impressive 16.4% in 2012, signifying a remarkable improvement at that time. Since then, deep learning CNNs have spurred the development of numerous applications in various domains.
During the training of AlexNet, batch stochastic gradient descent (SGD) was employed, incorporating carefully selected momentum and weight decay values. This groundbreaking model achieved exceptional performance on the challenging ImageNet dataset, setting a new record in the competition and solidifying the superior capabilities of CNNs. Later, there are many well-known CNNs developed, such as VGG-19, ResNet-50, Inception-V3, DenseNet-201, etc. Hereby our review begins with ResNet and Xception models, followed by a discussion on vision transformers, and then finishes with RNNs.

ResNet and Xception Models
The ResNet-50 model [26] is composed of 50 layers, featuring 16 residual blocks with three layers each, in addition to input and output layers. These residual blocks introduce identity connections that facilitate incremental or residual learning, enabling effective backpropagation. Through this approach, the identity layers progressively evolve from simple to complex representations. This evolution is particularly beneficial when the parameters of a CNN block start at or near zero. The inclusion of residual blocks helps address the challenging issue of vanishing gradients encountered in training deep neural networks with more than 30 layers.
Recently, an Xception [27] (Extreme Inception) network architecture has been proposed on the following hypothesis: the mapping of cross-channel correlations and spatial correlations in the feature maps of CNNs can be entirely decoupled. Thus, the Inception modules can be replaced with depthwise separable convolutions. The feature extraction base of the Xception architecture is constructed with 36 convolutional layers. For image classification, a logistic regression layer follows the convolutional base. Optionally, fully-connected layers can be added before the logistic regression layer. These 36 convolutional layers are organized into 14 modules, with linear residual connections encompassing each module, except for the first and last ones. When compared to Inception V3, Xception exhibits a comparable parameter count while demonstrating slight enhancements in classification performance on the ImageNet dataset.
In the Xception model, a depthwise separable convolution, also known as a separable convolution in deep learning frameworks, such as TensorFlow/Keras, is employed. This approach involves two steps: first, a depthwise convolution is performed independently on each channel of the input, followed by a pointwise convolution, which is a 1 × 1 convolution. The pointwise convolution projects the output channels from the depthwise convolution into a new channel space. The scenario of separable convolution plus pointwise convolution can significantly reduce the load of convolutional computation in contrast with a regular two-dimensional (2D) or three-dimensional (3D) convolutional layer; thus, it speeds up the CNN model training and the inference process.

Vision Transformers
Initially, transformers were developed and applied primarily to tasks in natural language processing (NLP), as evidenced by language models, such as BERT (Bidirectional Encoder Representations from Transformers) [28]. Transformers ascertain the connections between pairs of input tokens, such as words in NLP, through a mechanism called attention. However, this approach becomes increasingly computationally expensive with a growing number of tokens. When dealing with images, the fundamental unit of analysis becomes the pixel. Nevertheless, computing relationships between every pair of pixels becomes prohibitively costly in terms of memory and computation. To address this, Vision Transformers (ViTs) calculate relationships among smaller image regions, typically 16 × 16 pixels, resulting in reduced computational requirements. These regions, accompanied by positional embeddings, are organized into a sequence. The embeddings represent learnable vectors. Each region is vectorized and multiplied by an embedding matrix. The resulting sequence, along with the positional embeddings, is then fed into the transformer for further processing. The Video ViT (ViViT) model has one additional process, called a video tube (cube, i.e., frames by height by width, e.g., 4 × 16 × 16) positional embedding, while the rest of the ViViT process is the same as ViT.
Self-attention is commonly applied to the vision transformer model. The calculation of self-attention is to create three vectors from each of the encoder's input vectors (in the NLP case, the embedding of each word). So for each word, a Query vector, a Key vector, and a Value vector are created by multiplying the embedding by three matrices that were trained during the training process. Multi-headed attention expands the model's ability to focus on different positions. It gives the attention layer multiple "representation subspaces". The multi-headed attention has multiple sets of Query/Key/Value weight matrices (e.g., a transformer uses eight attention heads, consisting of eight sets of weight matrices for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.
Two designs of the ViViT architecture are illustrated in Table 1, where the input is the frames of a video clip (shape of MMEW input: (14,224,224,3)), while the output includes seven probabilities corresponding to seven micro-expression classes. The ViViT_FM2 model has an additional CNN block, X_CNN, which is comprised of the first five convolutional layers plus one residual block from the Xception model. Table 1. Transformer model architectures-ViViT_FM1 and ViViT_FM2. Normalization and Dropout layers are omitted. The batch size (typically shown as None) is omitted in the "Output Shape" column. Attn_FF means attention-based feed-forward network. The numbers shown in "Output Shape" are assumed to be the inputs from the MMEW dataset.

Recurrent Neural Networks-ConvLSTMmodels
Recurrent neural networks (RNNs) are designed to leverage sequential information in data. Unlike traditional neural networks that assume inputs and outputs are independent of each other, RNNs recognize the importance of dependencies in tasks such as natural language processing (NLP); for instance, predicting the next word in a sentence benefits from knowledge about the preceding words. RNNs are termed "recurrent" because they execute the same operation for each element of a sequence, with the output relying on previous computations. Another way to conceptualize RNNs is as having a "memory" that retains information about prior calculations. In theory, RNNs can utilize information from arbitrarily long sequences, but in practice, they are typically limited to considering only a few preceding steps. Figure 2 provides an illustration of a typical RNN architecture.

Recurrent Neural Networks-ConvLSTMmodels
Recurrent neural networks (RNNs) are designed to leverage sequential information in data. Unlike traditional neural networks that assume inputs and outputs are independent of each other, RNNs recognize the importance of dependencies in tasks such as natural language processing (NLP); for instance, predicting the next word in a sentence benefits from knowledge about the preceding words. RNNs are termed "recurrent" because they execute the same operation for each element of a sequence, with the output relying on previous computations. Another way to conceptualize RNNs is as having a "memory" that retains information about prior calculations. In theory, RNNs can utilize information from arbitrarily long sequences, but in practice, they are typically limited to considering only a few preceding steps. Figure 2 provides an illustration of a typical RNN architecture. RNNs have exhibited remarkable achievements in various NLP tasks and applications involving temporal signals [29]. Among the different types of RNNs, long short-term memory (LSTM) networks are widely used and excel in capturing long-term dependencies. LSTMs are essentially similar to RNNs, but they employ a distinct method to compute the hidden state. In LSTMs, memories are referred to as cells, functioning, such as black boxes that take the previous (hidden) state, st−1, and the current input, xt, as inputs. These cells internally make decisions about what information to retain or discard from memory. Subsequently, they combine the previous state, the current memory, and the input. Remarkably, these LSTM units have proven highly effective at capturing long-term dependencies.
In some applications (e.g., predicting weather changes), we want to model temporal evolution (e.g., temperature changing over time), ideally using recurrence relations (e.g., LSTM). In facial micro-expression recognition, we need to capture the facial muscle movement over time. At the same time, we also expect to efficiently extract spatial features (e.g., facial muscle movement varying with locations), something that is normally RNNs have exhibited remarkable achievements in various NLP tasks and applications involving temporal signals [29]. Among the different types of RNNs, long short-term memory (LSTM) networks are widely used and excel in capturing long-term dependencies. LSTMs are essentially similar to RNNs, but they employ a distinct method to compute the hidden state. In LSTMs, memories are referred to as cells, functioning, such as black boxes that take the previous (hidden) state, s t−1 , and the current input, x t , as inputs. These cells internally make decisions about what information to retain or discard from memory. Subsequently, they combine the previous state, the current memory, and the input. Remarkably, these LSTM units have proven highly effective at capturing long-term dependencies.
In some applications (e.g., predicting weather changes), we want to model temporal evolution (e.g., temperature changing over time), ideally using recurrence relations (e.g., LSTM). In facial micro-expression recognition, we need to capture the facial muscle movement over time. At the same time, we also expect to efficiently extract spatial features (e.g., facial muscle movement varying with locations), something that is normally done with convolutional filters. Ideally, then, an architecture includes both recurrent and convolutional mechanisms, which are convolutional LSTM (ConvLSTM) layers.
As shown in Table 2, two designs of ConvLSTM models are illustrated for facial micro-expression recognition. The outputs of the 14 × 7 dimension correspond to 14 frames and seven micro-expression classes (MMEW), and the final recognition results of each video sample are the averaged results of 14 frames. The major difference between these two models is that the X_ConvLSTM_FM4 model is comprised of bidirectional ConvLSTM layers. In the context of an NLP model, a unidirectional ConvLSTM block can find some hints (such as the meaning of "it") from future sentences, while a bidirectional ConvLSTM block can find hints from both future sentences and previous sentences. It is not surprising that the number of model parameters in the bidirectional X_ConvLSTM_FM3 model is more than doubled compared with that of a unidirectional X_ConvLSTM_FM4 model.

Hybrid Models
As described in previous subsections, we know that CNNs can extract spatial features; furthermore, ConvLSTM layers can capture spatial-temporal changes. CNNs are good at modeling neighborhood changes, whereas transformers with attention can grasp sparse spatial relations (e.g., among different image blocks or across frames in a video clip). As shown in Table 3, two designs of hybrid models combine three NN models: CNN, ConvL-STM, and ViViT. The two models differ in the ConvLSTM layer, where the Hybrid_FM5 model has 128 filters and is followed by a pooling layer. The goal is to combine the methods to enhance the performance of a single method.
The Hybrid_FM6 model is illustrated in Figure 3, which consists of three blocks: CNN, ConvLSTM, and Transformer. First, the CNN block (of six convolutional layers) provides local spatial features, where the feature image is the last-layer output randomly selected from 1 of 64 filters and 1 or 14 frames. Second, the ConvLSTM block (of one layer and 64 filters) generates temporal features, where the feature image is randomly selected from 1 of 64 filters and 1 or 14 frames. Third, the Transformer block (of 343 patches and 96 embedding dimensions) presents sparse spatial relations. Fourth, there is a fully connected layer (of 512 filters) prior to the output layer (of seven filters). The Hybrid_FM6 model can predict a micro-expression (e.g., "Anger") with a given facial video (e.g., of 14 frames).
The Hybrid_FM6 model is illustrated in Figure 3, which consists of three blocks: CNN, ConvLSTM, and Transformer. First, the CNN block (of six convolutional layers) provides local spatial features, where the feature image is the last-layer output randomly selected from 1 of 64 filters and 1 or 14 frames. Second, the ConvLSTM block (of one layer and 64 filters) generates temporal features, where the feature image is randomly selected from 1 of 64 filters and 1 or 14 frames. Third, the Transformer block (of 343 patches and 96 embedding dimensions) presents sparse spatial relations. Fourth, there is a fully connected layer (of 512 filters) prior to the output layer (of seven filters). The Hybrid_FM6 model can predict a micro-expression (e.g., "Anger") with a given facial video (e.g., of 14 frames).

Experimental Results
Both datasets, MMEW and CASME II, were used in our experiments. The number of frames varies with different video files. Based on manual analyses of the minimal and maximal length of clips and filming speed (FPS), 14 frames are clipped in the MMEW dataset, whereas 24 frames are clipped in the CASME II dataset. If the number of frames in a video file is as m times long as the number of clipped frames (n), then m (typically m ≤ 2) samples (clips) are clipped from that file. However, there are no overlapped (repeat-

Experimental Results
Both datasets, MMEW and CASME II, were used in our experiments. The number of frames varies with different video files. Based on manual analyses of the minimal and maximal length of clips and filming speed (FPS), 14 frames are clipped in the MMEW dataset, whereas 24 frames are clipped in the CASME II dataset. If the number of frames in a video file is as m times long as the number of clipped frames (n), then m (typically m ≤ 2) samples (clips) are clipped from that file. However, there are no overlapped (repeatedly used) frames in the m samples from the same video file. The distributions of facial video samples (clips) from two datasets are shown in Figure 4, where the numbers of clips are larger than the number of video files. During the data splits (k-fold cross-validation) for training and testing, we will make sure that (i) multiple clips from one video file will split into one subset, i.e., either in training or in testing; (ii) the training set includes samples from all classes (stratified split). edly used) frames in the m samples from the same video file. The distributions of facial video samples (clips) from two datasets are shown in Figure 4, where the numbers of clips are larger than the number of video files. During the data splits (k-fold cross-validation) for training and testing, we will make sure that (i) multiple clips from one video file will split into one subset, i.e., either in training or in testing; (ii) the training set includes samples from all classes (stratified split).  All facial images were resized to 224 × 224 pixels, then standardized and normalized (intensity stretched). Ten-fold cross-validation was used in our experiments, and the final classification results were calculated by merging all 10-fold testing scores (instead of averaging 10 testing accuracies).
The first subsection briefly reviews the performance metrics used in our experiments. Then, the classification performances of eight NN models are presented. Score fusion improvement is described and quantitatively measured in the next two subsections. The time costs are reported in the last subsection.

Classification Performance Metrics
The performance of micro-expression recognition is measured by F1 score and accuracy, as shown in Tables 4 and 5, where the two metrics are defined as follows. Table 4. F1 scores of seven micro-expressions, weighted F1 scores, and Accuracy values varying over eight different NN models tested on the MMEW dataset (14 frames at 90 FPS in each sample).
The tests were conducted using 10-fold cross-validations. The prediction (probability) values from 10 validation folds were merged in one set to calculate the overall F1 (weighted average) and Accuracy scores. The highest F1 score or accuracy in each row is bolded. Notice that the highest accuracy of 0.6940 was reported on this dataset in 2018 [17].  All facial images were resized to 224 × 224 pixels, then standardized and normalized (intensity stretched). Ten-fold cross-validation was used in our experiments, and the final classification results were calculated by merging all 10-fold testing scores (instead of averaging 10 testing accuracies).

Metric\NN
The first subsection briefly reviews the performance metrics used in our experiments. Then, the classification performances of eight NN models are presented. Score fusion improvement is described and quantitatively measured in the next two subsections. The time costs are reported in the last subsection.

Classification Performance Metrics
The performance of micro-expression recognition is measured by F1 score and accuracy, as shown in Tables 4 and 5, where the two metrics are defined as follows. Table 4. F1 scores of seven micro-expressions, weighted F1 scores, and Accuracy values varying over eight different NN models tested on the MMEW dataset (14 frames at 90 FPS in each sample).
The tests were conducted using 10-fold cross-validations. The prediction (probability) values from 10 validation folds were merged in one set to calculate the overall F1 (weighted average) and Accuracy scores. The highest F1 score or accuracy in each row is bolded. Notice that the highest accuracy of 0.6940 was reported on this dataset in 2018 [17].  The Precision is the ratio TP/(TP + FP) where TP is the number of true positives, and FP is the number of false positives. False positives are the samples that are predicted as positives but labeled as negatives. The Precision is intuitively the ability of the classifier not to predict a negative sample as positive.
The Recall is the ratio TP/(TP + FN) where TP is the number of true positives, and FN is the number of false negatives. False negatives are the samples that are predicted as negatives but labeled as positives. The Recall is intuitively the ability of the classifier to correctly predict all the positive samples.
The F1 score can be interpreted as a harmonic mean of the Precision and Recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of Precision and Recall to the F1 score is equal. The formula for the F1 score is In the multi-class and multi-label cases, the average of the F1 score of each class is analyzed with a weighting parameter. The weights for averaging can be calculated by the number of supported samples in each class divided by the total samples. The F1 score is an alternative to the Accuracy metric as it does not require one to know the total number of observations (e.g., TN). On the other hand, Accuracy tells how often we can expect a machine learning model will correctly predict an outcome out of the total number of predictions. Table 4 shows the F1 scores of seven micro-expressions, weighted F1 scores, and Accuracy values varying over eight different NN models when tested on the MMEW dataset. Based on the accuracy values, the model of Hybrid_FM6 achieves 0.8853, which is the best on the MMEW dataset. Compared to the literature-reported accuracy of 0.6940 on this dataset (by a CNN model in 2018-the best performance on the same dataset that we could find from the literature) [17], Hybrid_FM6's accuracy is very high. It seems that ConvLSTM models are slightly better than ViViT models, both of which are better than CNN models (ResNet-50, Xception). According to the F1 scores, it looks as though "Anger" is the easiest to be recognized, while "Fear" and "Sadness" are mostly difficult to be detected. Table 5 shows the F1 scores of seven micro-expressions, weighted F1 scores, and Accuracy values varying with eight different NN models when tested on the CASME II dataset. Based on the accuracy values, the model of Hybrid_FM5 is the best on the MMEW dataset, and its accuracy is 0.6565. Compared to the literature-reported accuracy of 0.6341 on this dataset (by an SVM classifier in 2014-the best performance on the same dataset that we could find from the literature) [29], Hybrid_FM5's accuracy is pretty high. It seems that ConvLSTM models are slightly better than ViViT models, both of which are better than CNN models (ResNet-50, Xception). According to the F1 scores, it looks as though "Happiness" is the most difficult to be detected, while the other four expressions are equally hard to be recognized.

NN Model Performance on Two Datasets
Overall, the hybrid models (combination of ResNet, ConvLSTM, ViViT) overperform non-hybrid models, such as CNN, ViViT, and ConvLSTM models.

Performance Improvement Using Score Fusion
The performance of facial micro-expression can be improved using score fusion methods, where the multiple scores are from eight different NN models, as presented in Tables 4 and 5. There are several types of score-fusion methods: arithmetic fusion (e.g., average, majority vote) [30], classifier-based fusion, and density-based fusion (e.g., Gaussian mixture model) [31,32]. Based on score fusion performance [33,34], two classification-based score fusion methods are selected and presented in this study: Support-Vector Machine (SVM) and Random Forest (RF). The multiple scores are combined as feature vectors and then fed into a classifier for training (with labeled score vectors) or testing.
The Support-Vector Machine (SVM) is a supervised learning model utilized for nonlinear classification and data analysis [35]. In the context of training data with categorized observations, the SVM training algorithm constructs a model that can assign new data points to specific categories. For classification purposes, an SVM establishes a hyperplane (or a set of hyperplanes) as a separating line between data points belonging to different classes. The objective is to find the optimal hyperplane that maximizes the distance between the hyperplane and the closest data points in each class. This approach effectively minimizes the generalization error of the classifier by maximizing the margin between the hyperplane and the nearest data points in each class [36].
Random forest (RF) is a classification model employed in supervised learning tasks. It leverages ensemble learning, which combines multiple models to tackle complex problems rather than relying on a single model. The RF algorithm enhances accuracy by utilizing bagging or bootstrap aggregating. It generates individual decision trees by using random subsets of the training dataset as subsamples. Each decision tree produces its own output or classification. The final output is determined through majority voting, where the RF output corresponds to the class chosen by the majority of trees. This approach effectively mitigates the impact of overfitting that may occur in individual decision trees.
In this study, the SVM method employed a Gaussian kernel function and a one-versusone coding design. This configuration resulted in the utilization of seven (or five) binary learners for the corresponding seven (or five) classes. In the RF model, we trained an ensemble comprising 100 classification trees using the complete training dataset. At each decision split, a random subset of predictors (scores) was utilized. The selection of split predictors aimed to maximize the gain of the split criterion across all possible splits of the predictors. The final classifications were obtained by combining the results from all the trees in the ensemble.
Scores used for fusion are created in two ways: (i) a feature vector consists of eight class indices (e.g., 0, 1, . . . n − 1; n = 7 or 5) for the predicted classes (pred-class) from eight NN models; (ii) a feature vector consists of n accumulated probability values (sum-prob) per (onto) its predicted classes (n = 7 or 5) from eight different NN models. Three score feature-vector (pred-class or sum-prob) examples from datasets are shown in Tables 6 and 7.   Table 6 lists six pred-class feature vectors (for six facial video samples in six rows), each of which consists of the predicted class index (0-6 for MMEW, 0-4 for CASME II) across eight NN models, where its probability value (between 0 and 1) is also given (to calculate sum-prob features). Table 7 presents lists six sum-prob feature vectors (in six columns), each of which consists of the accumulated probability values (sum-prob) with regard to its predicted classes across n micro-expresses (n = 7 or 5) for each facial video sample. For CASME II Sample 2, two models classified it as "Surprise" (1) with sum-prob = 1.4677, four models classified it as "Disgust" (2) with sum-prob = 2.2599, and the other two models classified it as "Others" (3) with sum-prob = 1.4687. The majorityvoted result is "Disgust".
In sum, each pred-class feature vector is the concatenated classification results (indices of classes) from different models, while each sum-prob feature vector is the summed classification probabilities from different models unfolded along with various micro-expressions.
The score fusion experiments are conducted using 10-fold cross-validation, and the final results (as shown in Tables 6 and 7) are calculated with the merged 10-fold prediction outcomes. In the MMEW dataset, the RF method with pred-class features achieves the best overall. The accuracy of facial micro-expression recognition is improved to 0.9684 from 0.8853, which is a very good improvement. In the CASME II dataset, overall, the best method is still the RF method with pred-class features, which reaches 0.9112 in contrast with the best single NN model accuracy, 0.6565.
It seems the RF method with sum-prob features is better at recognizing some facial micro-expressions, such as Happiness, Surprise, and Disgust.

Metric for Fusion Improvement-Relative Rate Increase (RRI)
The performance improvement using score fusion (SF) cannot be properly measured by using the absolute difference of two accuracy rates (R V ). For example, improving R V from 80% to 90% seems to be more difficult than the improvement from 98% to 99%. Generally speaking, the improvement of R V via SF becomes increasingly difficult when the original rate approaches 100%. Thus, it is proposed to use the Relative Rate Increase (RRI) [29] to evaluate the fusion improvement, where R F is the accuracy rate via SF and R V is the mean of the accuracy rates from all classification models. If R V = 1 (no need to improve the accuracy via SF), then set RRI = 1. The absolute rate increase ARI = R F − R V , may not precisely measure the performance improvement as stated earlier. RRI ∈ (0, 1], where a higher value is better. According to the RRI definition, two fusion improvements-from 80% to 90% and from 98% to 99%-are equivalent, and both RRI values are 0.50. The two improvements are equivalent in the sense of their difficulty levels and the extent of the effort to implement them. Many metrics (e.g., F1, Precision, Recall) can be devised, wherein the RRI metric seeks to measure the actual improvement against the total amount of possible improvement.
The RRI values from the best SF results are listed in the right-most columns in Tables 8 and 9. In Table 8, RRI [F 1 (Anger)] = 1.0 means the SF improvement is perfectly done (cannot be better). RRI [F 1 (Fear)] = 0.8977 (the second best) means that the SF rate of 0.9708 is 89.77% improved in contrast with the mean rate of 0.7146. In Table 9, the best RRI [F 1 (Repression)] = 0.8914 represents that the SF rate of 0.9546 has an 89.14% improvement from the mean rate of 0.5821. The second best RRI [F 1 (Surprise)] = 0.8527 means an 85.27% SF increase on the basis of the averaged performance of individual models.  The number of model parameters and time costs of models are presented in Table 10 (on MMEW) and Table 11 (on CASME II). Time costs are related to the NN model (number of parameters) and data size (number of frames and samples). Model training is typically completed offline. In a real application, model inferencing (predicting) only processes one set of given frames or images. For example, using the Hybrid_FM6 model takes approx. 20 milliseconds per sample (14 frames) for predicting micro-expression, which is fast enough for real-time applications. The time costs on the CASME II dataset are longer due to processing more frame data (24 frames per video sample).

Summary and Discussion
In this study, we compared eight different neural network models in recognizing facial micro-expressions based on two datasets, where 6 of 8 models were newly designed for micro-expression recognition. The performance of the NN model in terms of accuracy (from high to low) is as follows: hybrid, ConvLSTM, vision transformer, and CNN. Overall we suggest the hybrid models that achieve the highest accuracy and yet are fast enough for real applications. The hybrid models are created by combining the fundamental building blocks from CNN, ConvLSTM, and vision transformer models, which are capable of extracting spatial features (in image neighborhood by CNN), summarizing temporal features (among video frames by LSTM), and capturing sparse spatial relations (among image blocks and video frames by transformer).
Score fusion can significantly increase facial micro-expression recognition rate. For example, "Fear" was only recognized at a low rate of 0.7146 (on the MMEW dataset). Random forest fusion improved the rate up to 0.9708, which is an 89.77% improvement according to the Relative Rate Increase (RRI) metric. The best overall accuracies from the hybrid models are 0.8853 (on the MMEW dataset) and 0.6565 (on the CASME II dataset), while score fusion can boost them up to 0.9684 and 0.9112. In addition, score fusion utilizes the outputs (e.g., predicted classes or probabilities) from multiple classifiers and has no additional hardware costs.
Information fusion can increase the recognition accuracy by combining the classification scores from different NN models and from different imaging modalities (e.g., infrared camera). With a large-scale dataset, the recognition reliability will also be improved. The inference latency may be further reduced with highly configured hardware (GPUs or multiple GPUs).
Our experimental results shed light on a new method for real-time micro-expression recognition. Also, score fusion can further improve the recognition system performance without extra hardware costs. Real-time micro-expression recognition can be implemented and integrated into mobile devices or humanoid robots, which will enable a friendly humanmachine interface taking micro-expressions into account for better decision-making.