Micro-expression spotting: A new benchmark

Micro-expressions (MEs) are brief and involuntary facial expressions that occur when people are trying to hide their true feelings or conceal their emotions. Based on psychology research, MEs play an important role in understanding genuine emotions, which leads to many potential applications. Therefore, ME analysis has been becoming an attractive topic for various research areas, such as psychology, law enforcement, and psychotherapy. In the computer vision field, the study of MEs can be divided into two main tasks: spotting and recognition, which are to identify positions of MEs in videos and determine the emotion category of detected MEs, respectively. Recently, although much research has been done, the construction of a fully automatic system for analyzing MEs is still far away from practice. This is because of two main reasons: most of the research in MEs only focuses on the recognition part while abandons the spotting task; current public datasets for ME spotting are not challenging enough to support developing a robust spotting algorithm. Our contributions in this paper are three folds: (1) We introduce an extension of the SMIC-E database, namely SMIC-E-Long database, which is a new challenging benchmark for ME spotting. (2) We suggest a new evaluation protocol that standardizes the comparison of various ME spotting techniques. (3) Extensive experiments with handcrafted and deep learning-based approaches on the SMIC-E-Long database are performed for baseline evaluation.


Introduction
Affective computing is the research field that processes, recognizes, interprets, and simulates human emotions, which plays an important role in human-machine interactions analysis. Affective computing can be related to voices, facial expressions, gestures, and bio-signal [1]. Among them, facial expressions (FEs) are certainly one of the most important channels used by people to convey internal emotions. There has been much research involved in the FE recognition topic. Several state-of-the-art FE recognition methods reported an accuracy of more than 90% [1]. Aside from ordinary FEs, under certain cases, emotions can manifest themselves in a special form which is called "Micro-expressions" (MEs). [2,3,4].
MEs are brief and involuntary facial expressions that occur when people are trying to hide their true feelings or conceal their emotions [5]. Research from Ekman [6] shows that MEs play an important role in psychology that helps understand hidden emotions. Spontaneous micro-expressions may occur swiftly and involuntarily, and they are difficult to be actively and willingly controlled. This characteristic of MEs allows several applications. For example, when evaluating the performance of lectures based on student's emotions, a ME fleeting across the face can reveal the abnormal hidden emo-tions of students. In addition to the potential applications in education, ME analysis can also be promising to be applied in other fields, such as medicine, business, and national security [2]. In business, a salesperson can use MEs to determine the customer's true response when introducing new products.
Border control agents can detect abnormal behaviors when asking questions to people. For example, the Transportation Security Administration in the USA has developed the SPOT program, in which airport staffs are trained to observe the passengers with suspicious behaviors by analyzing MEs and conversational signs [5]. In medical fields, especially in psychotherapy, the doctors can use MEs as a clue for understanding patient's genuine feelings [2]. Therefore, ME analysis plays an essential role in analyzing people's hidden feelings in various contexts.
Unlike regular facial expressions, which we can recognize effortlessly, reading micro-expression is very difficult for humans. It is because MEs are too short and fast for human eyes to spot and recognize. The performance of ME recognition performed by humans is still deficient even with a well-trained specialist. Meanwhile, many studies in the computer vision field have been reported the impressive performance for Facial Expression Recognition and Facial Analysis tasks [7]. Consequently, it is interesting to see how to utilize computer science, especially computer vision, for ME analysis.
The research of automatic ME analysis in computer vision is divided into two main problems: recognition and spotting. The former is the determining the emotional state of MEs, while the latter is locating the temporal positions of MEs in video sequences. Currently, most of ME research focuses on the recognition task with many novel techniques [3,8], while ME spotting still has limited studies. It is argued that detecting ME is probably even more challenging than ME recognition. A study in [2] reported an oblique reference for this issue in human test experiments, where the automatic ME recognition methods outperformed humans, while the accuracy is much lower when the ME spotting task is added to the ME analysis system. For real-world applications, ME positions must be determined first before any further emotion recognition or interpretation. Therefore, ME spotting is an important task in developing a fully automatic ME analysis system.
Although there is increasing attention involved in the spotting task, there are still several issues. First, the database for ME spotting is quite limited.
According to the survey of [3], there are only four spontaneous databases suitable to evaluate spotting methods. Additionally, the existing ME spotting studies have been evaluated on the very limited databases. The length of videos is often 5-10 seconds that does not consist of challenging cases: eye blinking, head movements, and regular facial expressions, which are easily confused with MEs. If we consider building the real environment system of ME analysis, we need to evaluate the ME spotting methods on longer videos with more complex facial behaviors. Thus, creating more challenging databases for ME spotting is an important task. Second, it is difficult to make a fair comparison between existing techniques as they were often evaluated with different evaluation protocols. Moreover, existing evaluation protocols only focus on the correct location of spotted ME samples without considering the accuracy of detected ME intervals. Several recognition techniques require the correct location of onset/apex/offset frames for the recognition [8,3,9]. Recently, most ME recognition studies have been devel-oped and experimented on manually segmented ME samples with an exactly right border, so the accuracy would largely reduce if the input ME samples contain incorrect frames. For example, in the research of Li et al. [2], the performance of ME recognition is more than 65% when evaluated on the manually processed ME samples. However, when authors evaluate their whole ME analysis system, which consists of both spotting and recognition, the performance of recognition is less than 55%. Thus, if the returned ME samples have many neutral frames or incorrect locations of apex frames, it can cause the lower performance of the recognition task. Consequently, it is meaningful to standardize the performance evaluation of ME spotting that conducts experiments under the same evaluation setting and the same metric.
Overall, to overcome the mentioned issues, we make three contributions to this work: • We introduce a new challenging database for ME spotting, which is the new benchmark for the spotting task.
• We suggest a new set of protocols that standardize the evaluation of ME spotting methods. Our protocols aim to make a fair comparison among spotting techniques by considering both the correct ME locations and ME intervals, which are two important factors for the following ME recognition step.
• With the new ME database and the proposed evaluation protocols, we evaluate several recent ME spotting approaches, including traditional approaches and deep learning approaches, and provide baseline results for reference in future ME spotting studies.
The rest of this paper is organized as follows. The second section reviews the works related to ME spotting and databases. The third section introduces the new ME spotting database. Next, we describe the selected spotting methods which are used for providing the baseline results. Then, the proposed protocols are introduced in section 5. Finally, the last section reports experimental results and our conclusion.

Related Work
MEs are subtle facial emotions and low-intensity movements. Thus they are difficult for spotting by ordinary people. The spotting of MEs usually requires well-trained experts. In order to develop an automatic ME analysis system, several public ME databases and ME spotting studies have been proposed in the literature.

Micro-expression database
As the ME analysis research in computer vision has become an attractive topic in the last few years, the number of publicly spontaneous ME databases is still limited. In the survey of Oh et al. [3], there are only 11 databases for ME research, while most of them are only built for ME recognition.
Additionally, several databases for ME spotting contain only the acted MEs which are different from the spontaneous ones thus the models built based on acted MEs would not work well in the real world application. In the case of MEs, spontaneous emotions are more genuine than acted emotions. Therefore, we only summarize the spontaneous databases which are suitable for ME spotting. In Fig. 1, three examples of ME samples from there spontaneous databases are illustrated.
SMIC is the first database recorded by three different types of camera: high speed (SMIC-HS), normal visual (SMIC-VIS), and near-infrared (SMIC-NIS). In 2013, Li et al. [10] also extended the SMIC by adding non-micro frames before and after the labeled micro frames to create the extended version of SMIC, namely SMIC-HS-E, SMIC-VIS-E, and SMIC-NIR-E. This dataset contains 166 ME samples from 16 subjects.
Following the previous version CASME and CASME-II, CAS(M E) 2 [11,12,13] is also extended for the ME spotting task. The CAS(M E) 2 database is divided into two parts: Part A contains both spontaneous macro-expressions and MEs in long videos, and Part B includes cropped expression samples with frames from onset to offset. Totally, CAS(M E) 2 has 194 ME samples captured from 19 subjects.
In the SAMM database [14], the authors added 200 nature frames before and after the occurrence of the micro-movement, making the spotting feasible. The SAMM is arguably the most culturally diverse database among all of the current public ME datasets. SAMM consists of 159 ME samples from 32 subjects.
The recent increasing need for data acquired from unconstrained "in-thewild" situations have compelled further efforts to provide more naturalistic high-stake scenarios. The MEVIEW dataset [15] was constructed by collecting poker game videos downloaded from YouTube with a close-up of the player's face. In total, MEVIEW consists of 41 videos of 16 subjects.
Overall, most of the databases are extended from the existing ones by adding neutral frames to after and before ME samples to create long videos.
This approach is reasonable because recording new videos takes much effort and requires experts to label the ground truth. However, the length of extended videos is still short (5-10 seconds per video), so that they do not contain many types of other facial behaviors as in real situations. This issue prevents the development of a real-life ME spotting system. Hence, we decided to create a new ME spotting database with longer videos and more challenging cases of facial behaviors.

Micro-expression spotting
When a potential application of ME analysis is implemented in real-life, it needs to detect the temporal locations of ME events before any recognition step can be applied. Therefore, MEs spotting is an indispensable module for a fully automated ME analysis system. Several studies have been involved in this topic. In the scope of this paper, we will explore the current trend of ME spotting research in the literature.
In the beginning, almost all methods try to detect ME in videos by computing the feature difference between frames. For example, Moilanen et al. [16] spot MEs by using the Chi-Square distance of Local Binary Pattern (LBP) in fixed-length scanning windows. This method is utilized to provide the baseline results in the first whole ME analysis system, which combines spotting and recognition [5]. Patel et al. [17] proposed calculating optical flow vectors for small local spatial regions, then using heuristic algorithms to remove non-MEs. Wang et al. [18] suggested a method named Main Directional Maximal Differences, which utilizes the magnitude of maximal difference in the main direction of optical flow. Recently, Riesz transform combining with facial maps has been employed to spot MEs automatically [19]. Kai et al. [20] proposed spotting the spontaneous MEs in long videos by detecting the changes in the ratio of the Euclidean distances of facial landmarks in three facial regions. We categorize the mentioned techniques as the unsupervised learning approach.
Nevertheless, the tiny motions on the face, such as eye blink and head movement, are usually very difficult to discriminate from ME samples by using the thresholding techniques. Hence, some later works proposed utilizing machine learning techniques as a robust tool to distinguish ME samples and normal facial behaviors. We categorize these methods as supervised learning approach. In [21], the first attempt utilizing machine learning in ME spotting was introduced. In this research, the authors employed Adaboost to estimate the probability of consecutive frames belonging to a ME.
Then, random walk functions were used to refine and integrate the output from Adaboost to return the final result. Recently, for providing the benchmark to standardize the evaluation in ME spotting studies, Tran et al. [22] proposed using a multi-scale sliding-window based method for detecting MEs.
This method tackles ME spotting as a binary classification problem based on a window sliding across positions and scales of a video sequence. Although studies from [21,22] take advantage of machine learning, the performances are still not good enough. The reason is that traditional learning methods are still not robust enough to handle the subtle movements of MEs.
Recently, deep learning has been becoming a new trend in many computer vision fields that overcomes traditional approaches. Several studies started to apply deep learning for ME spotting problem. Zhang et al. [23] firstly propose using Convolutional Neural Network (CNN) to detect the apex frame by Although different techniques have been introduced for the ME spotting and some achievements have been obtained, the application to a real ME analysis system is still far away. One of the reasons is that current ME databases that are used in the development of ME spotting methods are still not challenging enough comparing to the practical situations. Most of the videos in the current ME spotting databases are 5-10 seconds short video.
For real applications, we need to evaluate ME spotters on longer and more challenging videos. shows the ME sample of "anger" emotion from subject 6. (Bottom) "Positive" ME sample from subject 2 of SMIC-HS dataset.

Evaluation protocols
With the significantly grown number of recent ME spotting research, there are also various evaluation metrics of ME spotting methods proposed in the literature. From the survey of Oh [3], the evaluation of ME spotting can be divided into two main approaches: ME sample spotting based evaluation and apex-frame spotting based evaluation.
Firstly, in the research of Li et at [5], the authors propose considering the spotted apex frame inside the interval [onset − N −1 4 , of f set + N −1 4 ] as the true positive, where N is the length of detected window. They then plot the Extending from ROC, Duque et al. [19] added the Area Under Curve (AUC) as the standard evaluation metric.
In the next study [21], authors still utilized the ROC curve to test their method. However, the difference is that the true positive samples are detected by the Intersection over Union (IoU) between the detected sample and ground truth. This method is formulated as: where X W andX G are spotted intervals and ME ground truth respectively, ε is set as 0.5.
In 2017, Tran et al. [22] also proposed the first version of ME spotting benchmark to standardize the performance evaluation of the ME detection task. Based on the similar protocol of object detection, they utilized the same true positive detection as Eq 1 and plotted Detection Error Tradeoff (DET) curve to compare spotters. Besides, there are also other works applying interval-based evaluation as Li et at. [5], but they proposed another performance metric such as F 1 − score [24,25].
Different from the above studies, instead of detecting ME intervals, several works follow the spotting of the apex frame location [26,27]. To evaluate the performance of these methods, they selected Mean Absolute Error (MAE) to compute how close the estimated apex frames to the ground-truth apex frames are [26,27,3].
When performing the spotting on long videos, Liong et al. [28] introduced another measure called Apex Spotting Rate (ASR), which calculates the success rate in spotting apex frames within a given onset and offset range in a long video. An apex frame is scored one if it is located between the onset and offset frames, and 0 otherwise. In another approach, Nistor et al. [29] proposed a combination of metrics, i.e., intersection over union percentage, apex percentage, and intersection percentage, to determine the number of correct detected apex frames.
Since there are various evaluation protocols for ME spotting, the inconsistency in comparing the performance among existing methods happens.
Besides that, in the existing protocols, the evaluation between ME spotting methods still has several uncovered issues. For example, if two methods have the same number of true positive detections, there are still no metrics to determine which method is better in terms of the ME interval coverage.
Therefore, it is necessary to design a standard evaluation protocol that makes a fair comparison for ME spotting methods.

SMIC-E-Long database
In this section, we describe our work in the construction of a new dataset for ME spotting. We realize the urgent need to explore the performance of the ME spotting task in a realistic environment. Thus, creating a new challenging database for ME spotting, which has longer videos and more complex facial behaviors, is reasonable to explore the performance of existing ME spotting techniques.

Data Acquisition
Construction of a completely new ME database often costs much effort and requires experts to label the ME ground-truth. Following the work of SMIC-E, CAS(ME) 2 [10,13], these databases are extended to enable the ME spotting by adding nature frames to before and after ME samples. However, the number of frames added to these datasets is still limited. In the construction of SMIC [30,10], there are a huge number of neutral frames remaining from the recording step which are captured by the high-speed camera. By adding these remaining frames from the previous recording, the new dataset might contain many challenging cases such as head movements, eye blinking, and regular facial expression, which can increase potential false detections.
Thus, we decide to extend the SMIC-E for creating a more challenging spotting dataset, namely SMIC-E-Long.
In the beginning, we add more than 2000 to 3000 nature frames ( approximately 20 seconds) to before and after each ME sample to create longer videos (approximately 22 seconds per video). During this process, several ME samples in the original SMIC database are merged into a long video that has multiple ME samples. We also select a few clips (around 20 seconds per video) without ME samples but contain regular facial expressions that might appear in the real ME spotting situations.
Finally, our new data set is completed with 162 long videos. In Table 1 Comparing with existing datasets of ME spotting (in Table 1), our dataset has 162 long videos, while CAS(ME) 2 and SAMM have 32 and 79 videos, re- spectively. Additionally, the average time of one video in our dataset is much longer than the existing datasets (22 Seconds compared to 5 and 10 seconds). The proposed dataset also has more challenging cases than the existing datasets. As presented in Fig. 2, we introduce other facial movements in our dataset. Two examples are presented: eye blinking and regular facial expression, which can be easily confused with MEs. As illustrated, the eye blinking only takes 12 frames, so it is similar to the characteristics of ME samples. These features make the proposed dataset more challenging than existing ones for the ME spotting task.

Face Preprocessing
At the beginning of the facial analysis system, we need to conduct the preprocessing step to align face images over video frames. This step is necessary to reduce the differences of face shapes and changes caused by big face rotation across the video.
Generic face preprocessing includes four steps: (1) Face detection and In existing studies, several methods utilized various face alignment steps that make the differences in the final face size and affect the performance of the later spotting and recognition steps. For example, several methods utilized the Discriminate response map fitting (DRMF) [27,31], while another studies select Active Shape Model [32,21] to extract the landmark points.
This issue can cause an unfair comparison of different techniques. Therefore, we decide to carry out face-alignment for our dataset and provide the preprocessed face set that can be used as the standard input and make a fair comparison between methods. We first locate three landmark points in the first frame. Then the face registration is applied by using Local Weighted Mean [33]. The face area is cropped by using the designed template. In Fig. 3, we illustrate the designed template for face alignment. It includes three landmark points and the final face size. After detecting the landmark points in one frame, we will utilize these landmark locations in the next consecutive M = 30 frames for the face alignment. This is because the location of landmark points in short duration frames changes very little. For the frames with large head movements, we reduce the value of M and conduct the face alignment of the processing video again to ensure the quality of face-alignment.

Spotting Methods
In this section, we select various ME spotting methods from unsupervised and supervised learning approaches to provide the baseline results.

Unsupervised learning methods
Firstly, we implement the method of Li et al. [2]. This study provides the first baseline results for ME spotting. In this method, a scanning-window with size L is stridden over video sequence. In each position, we extract LBP features from the 6 × 6 spatial block division at first frame (HF), tail frame (TF), and center frame (CF) of the scanning-window. Then the distance between CF and the average value of TF and HF is computed by Chi-Square distance. Finally, the threshold method is applied to return the location of the apex frame of a detected ME. Details of this method are described on [2].

Main Directional Maximal Difference Analysis for spotting
The method Main Directional Maximal Difference Analysis (MDMD) was proposed by Wang et al. [18,34]. This method also utilizes a similar approach as LBP − X 2 -distance method [2] by using a scanning-window and blockbased division. MDMD uses the magnitude of maximal difference in the main direction of optical flow as a feature for spotting MEs.

Landmarks-based method
We re-implement the method from [20] to explore the performance of spotting ME samples in long videos by utilizing geometric feature based on landmark points. In this method, specific landmark points on the eyebrows and mouth areas are used to calculate the Euclidean distance ratio. A sliding window is then scanned across video frames to compute the change between the currently processing frame with the reference frame. A threshold method is used to decide which frames are MEs in the scanning window. Details of this method are described in [20].

Spatial-temporal feature Method
We utilize the method from Tran [22]. First, the ME samples are classified at each frame in a video sequence, then non-maximal suppression is applied to merge the multiple detected samples. Spatial-temporal features, which are widely applied in the ME and Facial Expression analysis, are extracted at each frame for the classification: • Local Binary Pattern for Three Orthogonal Planes (LBP-TOP). This feature was introduced from the work of Zhao et al. [23] as the extension of LBP for analyzing dynamic texture. It has been widely utilized as a feature for facial-expression analysis.
We utilize SVM to distinguish the micro and non-micro samples. The details of this method can be referred to [22].

CNN-based method
Among deep learning structures, CNN has become popular since it offers good performance in many studies in the computer vision field. For the ME spotting, we also apply CNN as a baseline for our dataset. The proposed method is developed based on the research in [23]. As illustrated in Fig. 4, this method consists of two main components.
The early step is a CNN model to distinguish the apex and non-apex frame.
We consider frames from onset to offset as the apex frame to increase the number of apex frames for training. Different from the original work [23], we select two state-of-the-art CNN architectures for training the model: VGG16 [35] and ResNet50 [36]. We select these two architectures because they have been considered as the-state-of-the-art image classification methods.
The later step is feature engineering, which is used to merge the nearby

LSTM-based method
Another baseline method for ME spotting is the sequence-based learning approach proposed by Tran et al. [24]. This method includes two main steps: feature extraction and apex frame detection based on a deep sequence model. To construct the sequence of spatial-temporal feature vectors, we slide a scanning-window across all temporal positions of a video. In each position, for example at frame i, we extract features in the sequence from frame i to (i + L). By using this strategy, we consider each position is a ME candidate.
In [22], the authors proposed the continuous ground truth score for each sample. However, we ignore these scores and consider that one sample is either ME or non-ME. That means, if one position is considered as a ME sample, the ground truth is set as 1, otherwise 0. For the spatial-temporal features, we utilize two kinds of features: HIGO − T OP and HOG − T OP [2].
For the apex frame detection, we employ the Long short-term memory (LSTM) network, which is a special kind of Recurrent Neural Network (RNN). It was introduced by Hochreiter [37], and was refined and popularized in many research works, especially in sequence learning. The power of the LSTM network in learning and modeling sequence data is useful for estimating ME's position in the video sequence.
The idea for using LSTM in ME spotting is to slide a scanning-window across the video sequence. Each scanning-window contains M spatial-temporal features corresponding to M temporal positions. The constructed network inputs M spatial-temporal features and predicts M values representing the score of MEs at each temporal position. The feature at positions (i * ) that has the highest score inside a scanning window and larger than a threshold is considered as a ME sample. The specific apex frame is determined by the When dealing with ME spotting in the long videos, most of the methods utilize the scanning-window to spot ME samples. Therefore, after predicting the ME score in each frame, we need to merge the nearby detected samples using the same feature engineering strategy of the CNN-based spotting method.

Proposed evaluation protocol
As mentioned in the related work section, there are various performance evaluation protocols proposed in several existing research works. It is difficult to say which one is better for the ME spotting. To standardize the per-formance comparison, we recommend a new set of protocols which is more generalized and suitable for the evaluation of ME spotting. Our proposed protocols will be described in the next sub-sections: (1) The experiment setup for splitting the data into the training and testing set. (2) The metric for evaluating the performance of spotting techniques.

Training and testing set
To split data into training and testing sets, we select Leave-one-subjectout (LOSO) cross-validation. LOSO is a common training setup that is widely utilized in ME analysis and facial expression recognition problems [2,24,38]. Particularly, in this method, the videos and samples belonging to one subject (participant) are kept for testing, while the remaining samples are used for training.

Performance Evaluation
For standardizing the performance evaluation of ME spotting methods, we propose two new ME movement spotting-based evaluation metrics: samplebased and frame-based. The evaluation using single apex frame detection, which is usually employed in previously proposed evaluation protocols, is not considered in our protocol. We argue that the information of a single detected apex frame is not sufficient enough for the following ME recognition step since most of ME recognition methods often require the correct extraction of the ME interval. Therefore, we want to focus on the evaluation metric that focuses on the correctly detected ME interval as well.

Sample-based evaluation
In this section, the sample-based evaluation metric for detected ME samples is introduced. First, we present how to decide one spotted interval is a true positive or false positive. In many ME spotting methods, the Intersection over Union (IoU) is a common approach that tackles the ME spotting as an object detection problem. This method is applied to the ME spotting task by considering the overlap between ME samples and spotted intervals (it is similar to comparing the boundary box of detection with ground truth in object detection problem [39]). The decision condition is shown in Eq.
(1). The value of ε is set as 0.5 in most of the studies. It is arbitrary but reasonable. Then, we utilize the the same evaluation metric in the study of Li et al. [32] to compare the spotting methods.
To provide a more informative comparison, we also present the results of several methods following the DET curve evaluation protocol utilizing in research of Tran et al. [22]. Following Tran et al. [22], we create the Detec- where N + is the number of ME samples in our dataset. V = 162 is the number of long video in our database.
Nevertheless, the use of IoU has a drawback caused by different lengths of ME samples in our database that have diverse lengths from 10 to 51 frames.
Therefore, several ME samples can be missed by the fixed-length detected window, which is usually implemented in previous ME spotting approaches [21,24]. For example, if we set the detected window length as 35, ME samples having a size from 11 to 17 will be missed by the evaluation based on IoU with threshold 0.5. Therefore, we define a new evaluation metric for the detected sample as follows. In the first step, we define three values for comparison as: "Hit" (TP), "Miss" (FN) and "False" (FP). "Hit" is counted when a ME ground truth has the center frame (C gt ) satisfies the condition with nearest detected window: |C w − C gt | ≤ 0.5 * L gt , where C w ,C gt are the center location of detected window and ground truth, respectively, and L gt is the length of detected ground truth. Otherwise, the un-spotted ground truth is counted as one "Miss". "False" is counted when one detected window has the nearest ground truth satisfying the condition |C w − C gt | > 0.5 * L gt . In

Frame-based evaluation
In the previous sub-section, we described the performance evaluation based on detected ME samples. The sample-based evaluation is a common approach when evaluating ME spotting methods. It is quite similar to the evaluation of object detection problems. However, ME spotting is a different topic with object detection. If we consider building an automatic system to detect and recognize ME in long videos, the quality analysis of detected samples should be explored. A bad detected ME sample could have too many neutral frames or miss ME frames for the next recognition task. Several recognition studies require spotted ME samples to be detected correctly (from onset -apex -offset) [5,8,40,41]. Therefore, the missing of ME frames or the redundancy of non-micro frames in the spotting results can heavily affect the performance of recognition.
For addressing this issue, it is meaningful to consider the frame-based evaluation, which focuses on the correctness of spotted frames in each detected true positive sample. We complement a metric called frame-based accuracy F. The frame-based accuracy F can be defined as the mean deviation of "lengths" and "centers" of all correctly detected (TP) samples. The calculation of F is defined as Eq. 4: where C w i and L w i are the center frame and window size of the i th true positive sample. ; C gt i and L g t i are the center frame and length of ME ground truth corresponding with the i th true positive sample , respectively.
n is the number of true positive samples. In Fig. 8, we illustrate the effectiveness of the F metric. In this figure, we can conclude that the top spotter of the detected window is better than the bottom detector. Smaller F value means more accurately detected ME locations. To the extreme case, if all TP cases perfectly match with the ground truth, i.e., the top spotter in Fig.   8, the F is 0.

Implementation
We describe here the implementation and training details of our selected baseline methods. The required development tools and the selection of important parameters will also be discussed.
First of all, to determine the value for L, the length of detected ME samples, we calculate the mean value of ME sample size. Since the mean value is 34, we set the detected window size L = 35. For ME spotting methods that return only the location of apex frames, such as the MDMD method, [ L 2 ] frames after and before the spotted apex frame are considered as inside the ME samples.
Next, we explain the implementation of each method. Firstly, we consider unsupervised learning methods. In the LBP − X 2 -distance method [2], the authors did not carry out the spotting on the aligned faces. However, we conduct our experiment on the pre-processed face. For extracting the LBP feature, we utilize the scikit-image with the default setup [42]. The landmark-based method is re-implemented by utilizing the DLIB toolbox to extract 68 landmark points for calculating the ratio between specific points. The window size of processing landmark points is also set as L.
In the spatial-temporal feature methods, we utilize the MATLAB tool provided by authors to extract three features LBP − T OP , HIGO − T OP , and HOG − T OP [39]. Following the research of Tran et al. [22], we only select one set of parameters: block division 8×8×4, overlap 0.2, number of bin is 8. The length of images sequence for spatial-temporal feature extraction is set as L, and we employ scikit-learn toolbox to train the model. In this method, we also utilize the Temporal Interpolation Model (TIM) to scale the long videos into factor: 1, 0.75 and 0.5. We aim to return the adaptive size of

Results
In this section, we present the baseline results based on our proposed evaluation protocols. We also conduct our experiment on two public datasets, CAS(ME) 2 and SMIC-VIS-E, to highlight the challenges of the introduced new dataset. The LOSO is used to split the dataset into the training and testing sets, as discussed in Section 5.1 and 3. Two groups of the experiment results are shown in three tables: 2, 3, and 4. Details of each set of experiments are described in the next following sub-sections.
In this set of experiments, we evaluate the baseline methods based on the existing evaluation metrics: Precision, Recall, F1-score [32]. The results are displayed in Table 2 and Table 3. The performances of DET curve are displayed in Fig. 9,  To demonstrate the challenge of our dataset, we provide the experimental results on the SMIC-VIS and CAS(ME) 2 datasets. As presented in Table   3,  Table 2, the number of FP samples is very high. This is because all of the baseline methods have difficulty in discriminating ME samples with other facial movements that appear more in our new dataset.  Figure 9: Performance of detectors based on interval-domain (lower is better).

Comparison on proposed protocol
The experimental results of selected baseline methods with the proposed protocol are shown in Table 4. In Table 4, we realize that the performance of  On the other hand, although SVM-based methods spot ME samples with the adaptive length of detected windows, they are limited on three sizes of scale factors, while our dataset has the distribution size of ME samples ranging from 11 to 56. It causes the low "F" values in the SVM-based methods comparing to the LBP − X2. In the future study, the onset-apex-offset detection should be considered to improve the performance of the "F" value.

Discussion
In this sub-section, we analytically show the effectiveness of our selected spotting methods. Our analysis is split into four parts as following details.
Unsupervised method and supervised method. We firstly explore the effectiveness of unsupervised and supervised approaches. Based on the results of the F1-score in Table 2 and Table 4, we can observe the similar performance between two methods: the first place is HIGO −T OP −LST M (a supervised method), and the second place is LBP − X2 (an unsupervised method). However, the next places belong to supervised methods. When considering the F P values, most of the supervised methods are better than the unsupervised methods. This issue can be explained that the supervised methods have better tools to distinguish ME samples from non ME samples.
The (LBP − X2) method does not utilize the training term but surpasses several machine learning techniques to rank second place. Unlike other meth- Handcrafted feature and deep feature. Next, we explore the performance of supervised methods, which can be further split into two approaches: handcrafted feature learning and deep feature learning. In Table 4, an inter-esting observation is that the use of the LSTM is not always better than the traditional machine learning techniques in term of F1-score (HOG − T  features. Therefore, the correlation between spatial-temporal features and learning models is an interesting topic to explore in future research. Additionally, the number of ME samples for training is still not enough, while deep learning techniques often require a huge number of training data.
Another deep learning technique in our baseline is the CNN-based method, which has lower performance since it is hard to discriminate between micro frames and neutral frames by only using a single frame. As mentioned in the specification of the CNN-based ME spotting method, frames from onset to offset of ME samples are considered as apex-frames to train the image classification model. However, when observing a single frame, several extrinsic facial movements also have similar facial action units to ME samples. This issue makes confusion in the image classification model when discriminating ME frames and other frames containing other facial movements. Hence, CNN-based methods can detect the frames of other facial movements as ME apex frames. In general, we can conclude that the use of deep sequence learning for ME spotting is promising, but we need further improvements by increase the number of ME samples.
Appearance feature and geometric feature. In our experiments, almost methods try to spot ME samples based on the appearance features: spatial-temporal feature (LSTM-based methods and SVM-based methods), optical flow (MDMD), texture analysis, and raw-image. We only have one method using geometric features (Facial Landmark points), but this method has lower results comparing to the appearance-based approaches. The problem of the facial landmark method is that it is developed by a rather simple idea which calculating the ratio of the distance between landmark points.
Therefore, this method can not detect several ME intervals having too small movements. Especially, several ME cases only occur in the eye regions that make them similar to eye blinking actions. These cases cause missing of TP samples or triggering of false detections.
Frame-based performance Additionally, most of the spotting methods use the fixed-length detected window to return the ME intervals. This issue can increase the number of neutral or missed ME frames in the interval. In our experiments, the SVM-based methods carried out the multi-scale detection to return the adaptive-size windows. However, the performance of these methods is still low. Additionally, the "F" values are not outperformed other methods that have similar TP values. To address this issue, future research needs to consider the spotting techniques that can output the adaptive detected window or focus on the spotting of the onset and offset locations.
Through the results in Table 4, it can be seen that our new dataset is very challenging following the poor F1-score and "F" values of baseline methods.
It shows that the spotting task in long videos is still a very challenging problem for future research.

Conclusion
In this paper, we have introduced a new challenging dataset for ME spotting. We also suggested a new set of evaluation protocols to standardize the comparison among ME spotting techniques and enable a fairer comparison.
Finally, we explored various spotting algorithms from handcrafted to deep learning methods to provide the baseline for future comparison.
Following the experimental results, there are still several issues that we can improve in future works. First of all, the poor performance of deep learning techniques can be improved by applying the data augmentation to increase the number of ME samples. We can employ better deep sequence models to analyze the spatial-temporal information. Cross-database evaluation needs to be conducted to explore the generalization capability of ME spotting techniques in future research. Furthermore, the detection of the onset-offset should be included in the next study to locate the ME samples correctly. Finally, the ME recognition will be considered to create a complete benchmark for the whole automatic ME analysis system.