Cross-Domain Few-Shot Micro-Expression Recognition Incorporating Action Units

Micro-expression, different from ordinary facial expressions, is an involuntary, spontaneous, and subtle facial movement that reveals true emotions which people intend to conceal. As it usually occurs within a fraction of a second (less than 1/2 second) with a low action intensity, capturing micro-expressions among facial movements in a video is difficult. Moreover, when a micro-expression recognition system works in cold-start conditions, it has to recognize novel classes of micro-expressions in a new scenario, suffering from the lack of sufficient labeled samples. Inconsistency in micro-expression labeling criteria makes it difficult to use existing labeled datasets in other scenarios. To tackle the challenges, we present a micro-expression recognizer, which on one hand leverages the knowledge of facial action units (AU) to enhance facial representations, and on the other hand performs cross-domain few-shot learning to transfer knowledge acquired from other domains with different data labeling protocols and feature distribution to overcome the scarcity of labeled samples in the cold-starting scenario. In particular, we draw inspirations from the correlation between micro-expression and facial action units (AUs), and design an action unit module, aiming to extract subtle AU-related features from videos. We then fuse AU-related features and general features extracted by optical-flow facial images. Through fine-tuning, we transfer knowledge from datasets in different domains to the target domain. The experimental results on two datasets show that: (1) the proposed recognizer can effectively learn to recognize new categories of micro-expressions in different domains with a very few labeled samples with the UF1 score of 0.544 on CASME dataset, outperforming the state-of-the-art methods by 0.089; (2) the performance of the recognizer is more competitive when it distinguishes micro-expression videos of more categories; and (3) the action unit module enables to improve the recognition performance by 0.072 and 0.047 on CASME and SMIC, respectively.


I. INTRODUCTION
Face is an ideal site to transmit information among different parts of the body, attributed to diverse, obvious, and quick facial muscle movements. Facial expressions refer to these facial movements that convey emotions and intentions of human [1]. Unlike ordinary facial expressions, facial microexpressions occur within a fraction of a second with a low action intensity. Their involuntary emotional leakage usually expose true emotions and feelings, which people tend to hide. In some cases, even though people can deliberately pose false and misleading facial expressions, they could hardly hide The associate editor coordinating the review of this manuscript and approving it for publication was Kathiravan Srinivasan . their micro-expressions, which reveal their real emotional states [2]. Haggard and Isaacs [3] once played video records of conversation between a patient and a psychotherapist at a slow rate, spotting transient micro-expression of grimace between patient's smiles.
Due to the true emotions revealed by natural and involuntary micro-expressions, micro-expression recognition technologies have a wide scope of applications in the fields, such as psychological and clinical diagnosis, criminal investigation, judicial judgment, etc.
In the literature, substantial efforts have been made on micro-expression recognition. Majority of the work pay attention to the process of feature extraction. Given a facial image or a video, the micro-expression recognition system VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ needs to generate feature representations with a specific feature extraction method. Feature representations help the recognition system summarize features of raw data, discarding irrelevant information to the recognition task. Based on the generated representations, the data samples can further be classified into several micro-expression categories by the system. Traditional micro-expression recognition methods (e.g., Local Binary Pattern histograms from Three Orthogonal Planes (LBP-TOP) [4] and Bi-Weighted Oriented Optical Flow (Bi-WOOF) [5]) extract hand-crafted features from raw data, mapping micro-expression videos to a feature space. These features are then classified by a classifier like support vector machine (SVM). With the recent development of various deep neural networks (e.g., Visual Geometry Group (VGG) [6], AlexNet [7] and ResNet [8]), deep neural networks are adopted in micro-expression recognition [9]- [16]. As deep neural networks (DNN) can acquire deep features from input samples, DNN-based micro-expression recognition methods significantly outperform the traditional methods. However, DNN-based techniques need large-scale annotated datasets to train feature extractors and classifiers. Otherwise, the recognition models tend to overfit the samples provided, resulting in poor classification performance.
Despite much progress in micro-expression recognition mentioned above, few applications have been implemented up till now. Confronted with two main challenges, it is difficult to widely use micro-expression recognition technologies in real-life scenarios. First, as the duration of a micro-expression is short, and its occurrence is relatively rare, we could capture a very limited number of micro-expression samples from a large amount of facial videos. Hence, features learnt by micro-expression recognition systems are limited, and it is challenging to provide a method of high recognition accuracy. Second, in real-life scenarios, micro-expression recognition systems are under a cold-start problem. They have to recognize micro-expression videos of unseen classes with a few labeled samples to learn from. As large-scale micro-expression datasets only contain samples of basic emotion categories for the sake of universality, recognition system trained with such datasets cannot recognize task-specific emotion categories. For example, panic, anger and anxiety may appear in prisons, which are unseen when training a recognition system. Meanwhile, as it is labor-intensive and time-consuming to build a micro-expression dataset of these new micro-expression categories from scratch, training a recognition system with task-specific datasets is not feasible, only few labeled samples can be available. In other tasks under cold-start conditions, transfer learning [17] methods are introduced, using datasets of other scenarios to augment training datasets. However, it is also challenging to use datasets available. Since the protocols of micro-expression data collection are not unified [18], [19], the datasets introduced could be quite different from micro-expression samples the model needs to recognize (e.g., things eliciting micro-expressions). Most important of all, the categories of micro-expressions could be quite different in datasets introduced in the application scenario. For example, the auxiliary diagnostic system for psychotherapists needs to recognize repression, despair, and anxiety, while extreme emotions like anger could be prioritized for prison management systems.
To tackle the first challenge, we leverage knowledge of facial action units (AUs) to strengthen facial representations for micro-expression recognition. It is inspired by the significant correlation between micro-expressions and facial action units, as well as established research on facial action units. AUs are a set of objective labels describing facial muscle movements, and they are related to different facial regions. Micro-expressions of a specific emotion category are corresponding to certain groups of AUs. For example, facial expression of happiness includes cheek raiser (AU6) or lip corner puller (AU12) [20], while sadness includes inner brow raiser (AU1) [21]. Therefore, we can enrich features learnt from raw data by incorporating AU-related features. Furthermore, as AUs are region-specific, AU-related features can guide the model to place more emphasis on local regions posing significant influence on emotion expression. Previous studies [20]- [25] have shown that a certain group of AUs cannot determine micro-expression category solely, i.e., two different micro-expression may share the same group of AUs. Thus, the feature extractor in our proposed model not only extracts AU-related features from raw data, but also considers general features extracted from optical-flow facial images.
For the cold-start challenge, we perform cross-domain fewshot learning, which is getting researchers' attention recent years. There are areas related to cross-domain few-shot learning, which also learn unseen classes from streaming data. Continual learning [26]- [29] requires a model to learn new tasks sequentially and avoid forgetting former knowledge catastrophically. Active learning enables models to interactively select samples to be labeled by specialists or other sources [30]- [32]. Different from these studies, cross-domain few-shot learning focuses on the data scarcity problem of the new task. Two methods (fine-tuning and metric-based fewshot learning method) are adopted to enable the model to acquire knowledge from datasets available in other scenarios (source domain), and then transfer the knowledge to the scenario where it works (target domain), recognizing novel classes with only a few labeled samples.

A. KEY CONTRIBUTIONS
The main contributions of the study are summarized as the following: • We propose Micro-expression Recognizer incorporating Action Units (MERAU), which incorporates knowledge of facial action units, and effectively learns to recognize new categories of micro-expressions in a different domain with a few labeled samples. We are the first to combine AU-related features with general features extracted from optical-flow in micro-expression recognition.
• To handle cold-start problems in potential applications of micro-expression recognition, we perform cross-domain few-shot learning for micro-expression recognition.
• We propose that incorporating AU-related features in feature extraction can help the model better differentiate samples of different categories at the representation level. The remainder of this paper is structured as follows.
In Section II, we summarize studies related to our work. In Section III, our proposed micro-expression recognition framework is introduced, along with two different learning methods applied to this framework. Section IV describes the details of our experiments and experimental results. We conclude our work and point out potential research directions in Section V.

II. RELATED WORK
In this section, we review relevant literature on microexpression recognition. Additionally, studies on facial action unit detection, cross-domain few-shot learning, and multimodal fusion techniques are also summarized.

A. MICRO-EXPRESSION RECOGNITION
Micro-expression recognition methods can be categorized into statistical methods and deep learning methods by forms of feature extraction. Statistical methods adopt handcrafted feature extraction to describe the characteristics of micro-expression videos, aiming to transform original data into statistic features. A representative statistical method, LBP-TOP [33], is used as baseline in Facial Micro-Expressions Grand Challenge (MEGC) 2018 [34] and MEGC 2019 [18].
In comparison, deep learning methods utilize deep neural networks to extract features from micro-expression videos. As deep convolutional networks are powerful for extracting discriminative features from original data, deep learning methods outperform statistical methods in most micro-expression recognition scenarios. So far, a lot of deep learning methods have been developed for recognizing micro-expressions with deep neural networks [9]- [16].
Some studies such as [9] considered all frames in a video when extracting features, increasing the computational complexity at the same time. Nevertheless, Liong et al. [5] found out that not all frames are necessary for providing adequate information, and prompted the use of only onset and apex frames of a video instead.
Previous studies also considered recognizing microexpression with the aid of facial action units. After extracting features with 3D ConvNet, Xie et al. [35] transformed those features into a feature map for building an AU graph. A Graph Convolutional Network (GCN) was then used to process AU node features and provide information for microexpression recognition. Unlike this work which only relied on AU-related features for micro-expression recognition, we integrate AU-related features with general features extracted from optical-flow images in micro-expression recognition based on the previous studies [20]- [25], which showed that a certain group of AUs cannot distinguish micro-expression categories well.
Recent work has considered practicality of microexpression recognition systems in the real world. Li et al. [36] handled small training dataset of micro-expression by using neighbouring frames of apex frame, Lai et al. [37] and Hashmi et al. [38] focused on real-time micro-expression recognition, proposing end-to-end micro-expression recognition systems.

B. FACIAL ACTION UNIT DETECTION
Studies on facial action units detection can also be divided into two main categories: AUs occurrence detection [39]- [43] and AUs intensity estimation [44]- [48]. AUs occurrence detection intends to recognize the occurrence of each AU, transforming AUs detection into a multi-labeled binary classification problem. In comparison, AUs intensity estimation considers not only the presence but also exact intensity levels of AUs, i.e., from 1 to 5.
Early research on facial action unit detection used features of the whole face with hand-crafted feature extraction methods [49]. Since each AU is related to a certain facial region, sparsity-induced methods were then introduced into AU detection, reducing interference from irrelevant regions. For instance, Zhao et al. [41] proposed a region layer. Instead of sharing weights across the entire image, the region layer has local convolution components for different facial regions, thus enabling the model to capture local appearance changes. Li et al. [42] attached E-Net and C-Net to a conventional deep convolutional network. E-Net places more emphasis on active regions related to AUs with an attention mechanism, and C-Net crops AU areas of interest.
Due to the difficulty in AU labeling, some studies intend to reduce dependence on manual annotation, focusing on weak-supervised or self-supervised AUs detection to reduce dependence on labeled samples. Weak-supervised studies do not need correct and exact labels from human annotation. Zeng et al. [40] proposed a weak-supervised learning method based on confidence. Zhang et al. [47] used prior knowledge that AU intensity increases monotonically between the onset frame and apex frame during a facial action. Selfsupervised learning generates supervisory information from unlabeled data, using its own structure. Twin-Cycle Autoencoder [39] disentangled AU related movements from head motion related ones in videos. This model was trained with facial image pairs of the same person in videos. With the absence of manual annotation, the model learned to recognize displacements of pixels between the source image and the target image. Thus, the model can be optimized with the reconstruction loss.

C. CROSS-DOMAIN FEW-SHOT LEARNING
Few-shot learning is an important subproblem of machine learning. It aims to improve performance of models on a VOLUME 9, 2021 specific task with knowledge acquired from a few labeled samples [50]. In many real-life scenarios, due to the lack of labeled samples for training, models are likely to overfit and perform poorly on testing sets. Therefore, researchers have proposed a series of methods to tackle this problem. For example, ProtoNet [51] computes the mean of samples in each category as prototypes in the feature space. In this way, a test sample can be classified by computing its distance to each prototype, and a closer distance indicates higher possibility of belonging. Siamese Network [52] embeds a sample pair into the feature space with an identical neural network, and applies a binary classifier to indicate whether the pair of samples belong to the same category. Unlabeled test samples can thus be classified by comparing them with the labeled samples in each category.
Cross-domain learning, also known as domain adaptation, requires models to solve problems in a target domain, only utilizing knowledge learnt from a source domain. However, samples in these two domains have different feature distribution. Since few-shot learning methods tend to use knowledge from other domains as supplementary knowledge, few-shot learning and cross-domain learning tasks are highly correlated and should be considered together [53]. A number of recent studies [54]- [57] try to address cross-domain fewshot problems, incorporating knowledge learnt from source domains. Chen et al. [57] addressed cross-domain few-shot problem in generic object recognition and fine-grained image classification. Two different fine-tuning methods are implemented, as well as several metric-based few-shot learning methods. Their experimental results show surprisingly competitive performance of fine-tuning methods. Inspired by their work, we introduce fine-tuning and metricbased few-shot learning methods into micro-expression recognition.
Multimodal fusion can be classified based on the fusion time [68]. Late fusion methods fuses multimodal features at the decision level, providing independent models for different modalities that do not interfere with each other [69]. Early fusion fuses at the feature level. Li et al. [66] concatenated three channels of a RGB image with two channels of the optical flow image before feature extraction. In this study, we take the early fusion strategy to integrate AU features with the ones extracted from optical flow images as the final embedding of the raw video in the feature space.

III. PROBLEM DEFINITION AND METHODOLOGY A. PROBLEM DEFINITION
Given a user's frame sequence containing an onset frame and an apex frame, denoted as x = (s onset , s apex ), our task is to identify his/her micro-expression y in the category set E t based on x. In the study, we consider two different category sets, i.e., E t = {Tense, Repression, Disgust, Surprise} | {Positive, Negative, Surprise}. Let X t be the set of onset and apex frame pairs.
Assume we only have a limited number of K labeled samples for each target class among E t , while the ramaining samples in X t are left unlabeled. If the model is merely trained on these samples, it can hardly obtain knowledge of microexpression, which will result in the poor performance when testing. Thus we intend to acquire knowledge from labeled samples in datasets available from other scenarios, referred to as source domain, and the samples in the scenario we coldstart are referred to as target domain, we cast the problem definition to a cross-domain few-shot learning setting.
Let E s denote the set of source categories, and E t is the set of target categories, where (E s = E t ) and (|E s | = |E t |). We use D train = {(x t 1 , y t 1 ), (x t 2 , y t 2 ), · · · , (x t nt , y t nt )} to denote nt labeled samples in the source domain, where (nt ns), (x t i ∈ X s ), and (y t i ∈ E s ) (for i = 1, 2, · · · , nt). Furthermore, to incorporate knowledge of AUs, we utilize another AU-labeled dataset D au . It shares the same set of frame pairs X s with D train , yet has a different annotation format from the micro-expression datasets D train , D support , and D test . Let D au = {(x a 1 , y a 1 ), (x a 2 , y a 2 ), · · · , (x a na , y a na )}, where na is the number of samples in the AU dataset, and for each (x a i , y a i ) ∈ D au , y a i is a 10-dimensional scalar value vector, signifying the existence of ten typical action units (Inner Brow Raiser, Outer Brow Raiser, Brow Lower, Lid Tightener, Nose Wrinkler, Upper Lip Raiser, Lip Corner Puller, Dimpler, Lip Corner Depressor, Chin Raiser) in x a i ∈ X s . Here, value 1 represents the existence, and 0 otherwise.

B. OVERALL FRAMEWORK
The presented Micro-Expression Recognition framework incorporates Action Units (MERAU) to cross-domain fewshot micro-expression classification. As shown in Figure 1, MERAU consists of two modality feature extractors (named AU module and Optical-flow module) and a classifier. Optical-flow module aims to acquire optical flow information from the onset and apex frames of a video with an encoder, and maps it to low-dimensional feature space with a projection layer. AU module extracts AU-related information from the apex frame of the video, and transforms it into two different feature embeddings. The three feature embeddings generated by Optical-flow module and AU module are then concatenated as the final embedding of the raw video in the feature space. MERAU implants two different ways of learning (fine-tuning and metric-based few-shot learning) to project the final embedding of the raw video into the label space, detecting the micro-expression category of the facial video given.

1) AU MODULE
We utilize Twin-Cycle Autoencoder (TCAE) [39] as the encoder of AU module. For a frame sequence, we only feed its apex frame s apex into this encoder. The output of TCAE encoder x au is then fed into an AU detector P pretrained with an AU dataset D au . The detector transforms the AU-related features into AU prediction p = [ω 1 , ω 2 , · · · , ω A ] with 1 1+e −ω i as the possibility that s apex has the i-th action unit. Since AU prediction feature p is obtained with additional supervision (AU pretraining), it may have different distribution from AU-related feature x au . In order to fuse x au and p, We use two projection layers with ReLU activation function to project them into the same feature space, separately. Meanwhile, the projection layer for x au transforms it into lowdimensional vectors, extracting task-related information. The projections in the feature space are denoted by e a1 and e a2 .
To incorporate knowledge of AU detection into our model, the AU detector P needs to be pretrained with the AU-labeled samples in D au . For a sample x in D au , P φ (x) ∈ [0, 1] A represents possibilities of occurrence of all AU labels. We thus compute AU loss L au as follows: Then we can achieve optimized parameters of P, denoted by Note that φ will be frozen in the follow-up microexpression training and testing.

2) OPTICAL-FLOW MODULE
As the category of micro-expression is not determined by AU-related information solely, we compute optical flow images using the onset and apex frames, which describe geometric deformations of facial videos, and then feed them into an Optical-flow module. We take ResNet18 [8] as the encoder of Optical-flow module, and use Gunnar Farneback Algorithm [70] to generate the dense optical flow as the input of Optical-flow module. We intend to map the highdimensional feature embedding x of obtained by Optical-flow module, to the same feature space of e a1 and e a2 , and fuse three features.
Hence, we use a projection layer with ReLU activation function to transform x of into a low-dimensional feature embedding e of as optical-flow feature of the video.

3) CLASSIFIER
We use M to denote the combination of AU module and Optical-flow module. For each sample input, the feature embedding given by M is the concatenation of three feature vectors, which can be denoted as: The classifier C then projects e into the label space, predicting the micro-expression category of facial videos given. Here, we adopt two different learning methods (i.e., finetuning and metric-based few-shot learning) to perform classification.
We use D train to train the feature embedding model M. M transforms samples into low-dimensional feature embeddings e, the process can be denoted by e = M θ,φ (x), where φ is a freezed parameter of AU detector P, and θ is a trainable parameter of M. Based on the label space of the dataset, classifier C transforms e into a category label, represented by p = C(e). We can denote the combination of feature embedding model M and classifier C by a function f θ,φ (x) = y, use loss function L exp to train M, and obtain: Note that the detailed form of loss function L exp depends on the learning method we use.

a: FINE-TUNING
Fine-tuning method uses a fully-connected layer as classifier. It has the weight of W ∈ R d×|E s | at the training stage, where d denotes the dimension of feature embedding e, and E s is the set of micro-expression categories in D train . The classifier is trained together with Optical-flow module and AU module. While in D support and D test , only the parameters of feature embedding model M are kept, and the weight matrix of C is re-initialized to W ∈ R d×|E t | , where E t is the set of microexpression categories in D support and D test .
The training and fine-tuning process are shown in Figure 2. For a basic classifier, when we feed feature embedding e into classifier C, the output isŷ Additionally, following the setting proposed by Chen et al. [57], we implement the fine-tuning method with a cosine-distance based classifier. The weight matrix of C is W ∈ R d×|E s | , which is the concatenation of |E s | vectors, [w 1 , w 2 , · · · , w |Es| ]. When a feature embedding is fed into the classifier, the output is: w 1 ), · · · , sim(e, w i ), · · · , sim(e, w |Es| )] (6) where sim is a cosine distance function. Given two vectors e and w, the output is computed as: sim(e, w) = e T w e w (7) VOLUME 9, 2021 Similar to the basic classifier, the cosine-distance based classifier is parameterized by W ∈ R d×|E t | at the testing stage.
For these two classifiers, we use the same cross entropy loss function as follows:

b: METRIC-BASED FEW-SHOT LEARNING
Metric-based few-shot learning uses distance metrics to differentiate between samples in a dataset. We implement Pro-toNet [51], a typical and effective metric-based few-shot learning method. It computes the mean of samples in each category as prototypes, and compares Euclidean distance between feature embeddings of query samples and prototypes. The core of ProtoNet method in micro-expression is to grasp the representative prototypes of each micro-expression category in the feature space. Despite the lack of labeled samples in target domain, the model learns how to generate microexpression feature prototypes with samples in the source domain. In the target domain, the model directly generates prototypes without learning. We assume that there are |E t | categories of microexpressions in D test , and each category has K labeled samples in D support . At the training stage, instead of using all labeled samples in training dataset D train , we pick samples in only categories E t from D train , and split them into Support Set {S 1 , · · · , S |E t | } and Query Set {Q 1 , · · · , Q |E t | }, where S i and Q i denote support samples and query samples of category i, respectively.
We group these support and query samples into different episodes. For each episode, we select K labeled samples from support samples S i for each category i, as episode support samples, and select T unlabeled samples from Q i as episode query samples. Thus, an episode contains a total of |E t | · (K + T ) samples. As shown in Figure 3, samples are first transformed into embeddings in the feature space by M, for each category, embeddings of all episode support samples are averaged into prototypes of the category, denoted by c.
The model classifies the category of each episode query sample q j by comparing it with all prototypes: where dist(·, ·) is the function to compute Euclidean distance between embeddings. Note that for two embeddings x and y, dist(x, y) = x − y 2 .
More details about the sample selection and parameter optimization can be found in Algorithm 1. To optimize parameters of feature embedding model M, we use a cross entropy loss based on distance: where c j is the prototype of category j.
for support sample s ∈ V s do 7: Append M θ,φ (s) to P 8: end for 9: c i ←P 10: end for 11: for i ← 1 to |E t | do 12: for query sample q ∈ Q i do 13: p ← M θ,φ (q) 14: Calculating L exp with Equation 10 15: Update θ with ∇L exp 16: end for 17: end for 18: end for

IV. EXPERIMENTS AND DISCUSSION
In this section, we present our experimental setting, including baseline methods, datasets we used, pre-processing methods, and evaluation protocols. Experimental results are reported and analyzed as well.

A. DATASETS AND PRE-PROCESSING
We conduct experiments on three micro-expression datasets, including SMIC [71], CASME [21], and CASMEII [20]. SMIC dataset has 164 micro-expression videos collected from 16 subjects. It contains three coarse-grained categories of emotion labels: positive, negative, and surprise. CASME dataset has 196 samples classified into 8 finegrained classes. CASME II dataset is larger than CASME, containing 255 samples of 7 categories. We screen out categories of few samples from CASME and CASMEII. The remaining categories of the three datasets are listed in Table 1.
As CASME and CASME II have similar categories, as well as settings of dataset construction, we train our model on CASME II, and test it on CASME and SMIC in order to evaluate its performance with different scales of domain-shift.
Besides data augmentation strategies such as random cropping, resizing, and rotation, we expand our training dataset by using neighboring frames of the apex frames in microexpression videos based on the previous studies [36], [47], which assert that neighboring frames have similar facial appearance and emotion expression. Thus, neighboring frames share the same micro-expression categories as apex frames.
Meanwhile, to balance the training samples of different categories, for each apex frame in videos labeled with microexpression category i, we calculate the number of neighboring frames (aug i ) to be added to the training dataset as follows: where N i is the number of samples in category i, and N min is the minimum among all N i .

B. IMPLEMENTATION DETAILS 1) COMPARISON METHODS
We implement two state-of-the-art methods: Quang's Capsu-leNet [12] for micro-expression recognition, and Liu's microexpression recognizer [11]. Additionally, several benchmarks methods are implemented, including LBP-TOP with uniform code [33] and VGG [6]. LBP-TOP is a hand-crafted micro-expression recognition method. Liu's work extracts features from optical-flow images, while Quang's CapsuleNet and VGG use apex frames as input images.
For a fair comparison, we apply the same data augmentation method to these baselines.

2) PRE-TRAINING
To incorporate knowledge of AUs into the model, we pretrain our AU module before the training stage. We adopt a pre-trained Twin-Cycle Autoencoder [39] as encoder in the AU module, and freeze its parameters. The learning rate of AU detector P in the AU module is set to 0.0012.
For each apex frame in CASME II dataset, there is a group of labels, indicating AUs that appear on the face. We use 10 AUs in CASME II dataset for pre-training. The process is shown in Figure 4.

3) CROSS-DOMAIN FEW-SHOT LEARNING
After the pre-training stage, we conduct cross-domain fewshot learning with fine-tuning and ProtoNet. The parameters of AU encoder and AU detector are frozen. We list the learning rates for other layers in MERAU and baseline models in Table 2.

C. EVALUATION
We evaluate the performance of our MERAU and baseline methods on CASME under the setting of 4-Way-5-Shot and 2-Way-5-Shot. As we adopt basic classifier and cosinedistance based classifier for fine-tuning, our method using these two classifiers are named MERAU and MERAU (CD), respectively.
For 4-Way-5-Shot setting, we select all four categories remaining in the CASME dataset. Meanwhile, for 2-Way-5-Shot, we divide samples in CASME into two groups: (1) Easy group and (2) Hard group. The Easy group contains micro-expression videos labeled with Disgust and Surprise, while the Hard group contains samples labeled with Tense and Repression. Samples in the Hard group are relatively more difficult to differentiate, since both Tense and Repression are negative feelings in coarse-grained classification. We will quantitatively verify this assumption in the next subsection. To avoid overfitting certain categories of samples, besides accuracy (ACC), unweighted F1-scores (UF1) and unweighted average recall (UAR) are chosen as performance metric of MERAU and baseline models, presented in Table 3. we use TP, TN, FP and FN to denote true positives, true negatives, false positives and false negatives, as there are N categories of samples, UF1, UAR and ACC can be calculated as: In addition, to evaluate performance of MERAU with a larger domain-shift, we test our MERAU and all baseline methods on SMIC, which has only 3 coarse-grained categories. The results are shown in Table 4.
As illustrated, our MERAU outperforms all baseline methods confronted with both shallow domain-shift (CASME II to CASME) and large domain-shift (CASME II to SMIC) on all three metrics. Under the same setting (e.g., 4-way-5-shot on CASME), performance of MERAU and baselines are consistent when using different metrics. Despite the fact that micro-expression recognizer has to differentiate among samples of more categories on CASME than on SMIC, the recognition accuracies are significantly higher on CASME. This is because larger domain-shift increases the difficulty in transferring knowledge learnt from the source domain to a new task on the target domain. The Positive and Negative categories of SMIC dataset have never appeared in CASME II, while most of the categories are shared between CASME II and CASME. In addition, subjects and data collection criteria are quite different from CASME II to SMIC. This assumption can be verified through 2-Way-5-Shot experiments we conduct, where the performance of all methods on Easy group is significantly better.
Furthermore, to quantify domain-shifts, we generate features of samples from source domain (CASME II) and two different target domains (CASME and SMIC) with a pretrained encoder. The encoder is a Resnet18, which has no prior knowledge about three datasets (SMIC, CASME II and CASME), in order to avoid interference. We then compute Maximum Mean Discrepancies (MMD) [72] between source domain and two different target domains, as follows.
Here, F is the unit ball in a reproducing kernel Hilbert space, S and T denote source domain and target domain.  {x s 1 , · · · , x s Ns } and {x t 1 , · · · , x t Nt } are features of samples from source domain S and target domain T . MMD can effectively represent distances between distributions. As the results show, the MMD between SMIC and CASME II is 1.1588, while it between CASME and CASME II is only 0.0756. It verifies our claim that there is a larger domainshift between CASME II and SMIC, than CASME II and CASME. As some cross-domain learning studies [73]- [75] constrained MMD or other domain-shift indicators between two domains to minimize domain-shift, achieving remarkable results of domain adaptation, it could further improve our system performance to incorporate these methods in the training process.
Comparing the performance of fine-tuning and ProtoNet, fine-tuning is superior to ProtoNet, because metric-based few-shot learning methods cannot fine-tune parameters of the model on the target domain. In other words, models trained with metric-based few-shot learning methods cannot acquire knowledge from the target domain well.
In fine-tuning, we replace basic classifier with cosinedistance based classifier, which improves the recognition accuracy of MERAU on CASME dataset. However, its performance on SMIC dataset is poorer. The reverse effect of cosine-distance based classifier on SMIC is attributed to the scale of domain-shift, since the cosine-distance based classifier is designed for reducing intra-class difference with the sacrifice of cross-domain adaptability.
According to the confusion matrix shown in Figure 8 and Figure 9, MERAU has better cognitive ability of recognizing familiar classes (Disgust and Surprise), but may confuse samples of the Tense class with those of the Repression class.

D. VISUALIZATION
In order to show the effectiveness of our proposed model, we give a case study, demonstrating the feature spaces learnt by Liu's model and MERAU. After training two models with D train and fine-tuning with D support , we feed all samples in CASME to both models. Feature embeddings generated by two models before classification are recorded. To visualize these feature embeddings, we conduct Principal Component Analysis (PCA) for dimension reduction, so that these samples can be presented in the same 2-dimensional space in Figure 5.
To quantify the effectiveness of the feature embeddings generated by MERAU and the baseline Liu's model, we split visualized samples into Easy group and Hard group. For each group of samples, we use SVM to acquire its best linear boundary, and then draw it as the blue line in Figure 6. Furthermore, we compute the classification accuracy of SVM for each image, and show it in Table 5. As the results shown, feature embeddings generated by MERAU have higher intracluster similarity and lower inter-cluster similarity, demonstrating that MERAU have better distinction among different categories of samples.

E. ABLATION STUDY
In order to verify the effectiveness of incorporating action units, we eliminate the AU module in our framework, and only feed optical flow features into the classifier. Consistent with former experiments, we pre-train this model on CASME II and fine-tune it on CASME and SMIC. Figure 7 shows the classification result. It turns out that the UF1 scores of classification improve by 0.0722 and 0.0471, respectively, after we incorporate the AU module. When we replace the basic classifier with cosine-distance based classifier, incorporating knowledge of AUs has similar improvements.

V. CONCLUSION AND FUTURE WORK
Micro-expression recognition has a wide range of applications (e.g., psychological and clinical diagnosis, emotional analysis, criminal investigation, etc.). However, when a micro-expression recognition system works in coldstart conditions, it has to recognize novel classes of micro-expressions in a new scenario, suffering from the lack of sufficient labeled samples. Meanwhile, inconsistency in micro-expression labeling criteria makes it challenging to use existing labeled datasets in other scenarios.
To tackle these challenges, we present a micro-expression recognizer, which on one hand leverages the knowledge of facial action units (AU) to enhance facial representation, and on the other hand performs cross-domain few-shot learning to transfer knowledge acquired from labeled samples in datasets available from other scenarios to classify samples in the coldstarting scenario. The experimental results show that our recognizer has better distinction among samples of different micro-expression categories and achieves better recognition accuracies than state-of-art methods. On UF1 metric, our recognizer outperforms baseline methods by 0.089 on CASME dataset, and 0.022 on SMIC dataset.
For future work, we assert that the micro-expression recognition accuracy of our recognizer largely relies on the performance of facial action units detection. One future work is to incorporate cross-domain learning methods into the pretraining process of the AU detector in our framework, as it has to work in a different domain and predict possibilities of different AU's occurrences. In addition, after the cold-start period, more labeled samples will be available, and the microexpression recognition model has to adapt to the new samples, also known as hot update. To avoid repeatedly learning from the old samples or forgetting knowledge learnt, continual learning technologies need to be further investigated.
YI DAI (Member, IEEE) is currently pursuing the bachelor's degree with the Department of Computer Science and Technology, Tsinghua University, Beijing, China. His research interests include computational psychology, sentiment analysis, and natural language processing.
LING FENG (Senior Member, IEEE) is currently a Professor of computer science and technology with Tsinghua University, China. Her research interests include computational mental healthcare, context-aware data management and services toward ambient intelligence, data mining and warehousing, and distributed object-oriented database systems. VOLUME 9, 2021