A Systematic Review of Facial Expression Detection Methods

Understanding emotions is one of the greatest capabilities of human beings, as it allows the understanding of facial expressions that facilitate the capture of important information about other individuals, which are used for the perception of mental or emotional states. Advances in Artificial Intelligence and Visual Computing, more specifically in Deep Learning with the advent of Artificial Neural Networks, have enhanced the ability of machines to infer human emotions through image analysis. This paper presents a Systematic Literature Review (SLR) with the purpose of researching, mapping and summarizing studies that address techniques or algorithms more efficiently. The convolutional neural network models analyzed in this review are based on deep learning with an emphasis on expression and microexpression recognition. The results suggest that database uses, with laboratory controlled images, combined with CNN’s such as VGG and ResNet, have excellent performances in their tests. For better understanding, we will detail and compare all the methods obtained in the review.


I. INTRODUCTION
Human facial expression is closely linked to the emotional issues characteristic of individuals and plays a crucial role in everyday communications [1]. Ever since the concept of Affective Computing (AC) was created by Picard [2], in 1977, it has been gaining prominence and evolving more and more in the academic environment, thus providing computers with the ability to detect human emotions. In this way, emotion recognition from the analysis of facial expressions allows machines to understand human emotions representing The associate editor coordinating the review of this manuscript and approving it for publication was Derek Abbott . extremely important application perspectives, as indicated in several researches [3], [4], [5], and [6].
There are many researches in the area of Affective Computing (AC) and Human-Computer Interaction (HCI), applied to both Computer Vision [7], [8], [9], [10], [11], as in other areas of knowledge, such as in Education, in support of Intelligent Learning [12], in Health, with support for Neurophysiological Monitoring [13] and in Sentiment Analysis for Natural Language Processing (NLP) [14], whose purpose is to develop cognitive, intelligent, and reliable emotion detection systems to distinguish and understand people's affect and thus provide sensitive and ready responses to users for applicability in a given area. The development of this system, generally, is made from techniques and models of Artificial Intelligence whose applicability is employed in various services in society today, such as measurement of Facial Affect [15]; in the Legal area, from the analysis of Affective Linguistic skills of lawyers [16]; Educational [17] and in Health, with the analysis of Psychobehavioral Behaviors [18].
Although there are many works, like the ones mentioned above, that make use of the techniques that cover the area of AC, there are few studies that map the most efficient algorithms and techniques in the area of facial expression detection. Thus, this work aims to develop a Systematic Literature Review (SLR) with the purpose of researching, mapping and summarizing studies that address techniques or algorithms most used for the recognition of facial expressions based on Artificial Intelligence.
The present research was motivated by the desire to contribute to the state of the art when the theme is focused on conducting SLR research on the use of facial expression detection methods; moreover, it contributes to the systematization of studies in this area through the use and application of a free platform available to all researchers. Thus, it is instigating and challenging to concatenate the various studies present in the literature, but at the same time it is worth noting the valuable contribution that is given to those who, perhaps, are developing research on this topic.
The Systematic Review and its results are divided into the following sessions: formulation of the problem and motivation of this SLR are described in the section II; the methodology applied for the development of the research discussed is in section III; the results obtained with the methodology applied are addressed in section IV; section V presents the discussion of the models used, algorithms and databases used in the filtered research; and, finally, section VI with the final considerations of the article and estimates for future work.

II. PROBLEM FORMULATION
The great advances in the field of Artificial Intelligence [19] have enabled a plethora of techniques proposed in the field of Computer Vision [20]. Such an achievement has increased the amount of existing methods used in emotion detection. The large mass of authors, along with their productions, have popularized various Deep Learning techniques and methods in academia, [21] and [22]. This process resulted in the massification of techniques used to solve problems in the computational environment.
Currently, there are numerous models employed in computer vision for facial expression recognition, such as [9], [11], and [23]. However, much of the production in this medium seeks both to evaluate the methods addressed, [9] and [24], and to apply them to particular areas of knowledge, [25], [26], and [27]. Although these approaches are more widespread and known in the scientific community, the literature lacks systematic studies that identify the main methods used in facial expression recognition. The analysis of these methods is extremely important to guide future work in this area, since the general analysis done by SLR favors numerous authors in their searches around a proposed topic [28].
That said, the goal of this systematic review is to map and analyze studies that address facial expression detection in order to categorize, summarize, and compare the models, techniques, and databases, as well as their performance, used in the context of facial expressions. The contributions of this research are evident in informing and motivating researchers in the field to apply their own facial expression detection models, and thus cooperate in the development of new research projects in this area.

III. METHODOLOGY
For the development of this research, an extensive search and screening of scientific papers was conducted through a SLR. This methodology seeks to evaluate and interpret relevant studies given a research question. It is of a higher quality when compared to other types of literature reviews because it is comprehensive and biased [28] where a series of steps were followed: the importation of articles, the selection of articles, quality control, and data extraction. For this, the free platform Parsifal 1 was used, which was created to carry out systematic reviews and helped in the search for research on the central theme of the article, defining the specific objectives through the five criteria known as PICOC (population, intervention, comparison, outcomes, and context), which facilitated the structuring of the research questions, search chain, keywords, synonyms, and the inclusion and exclusion criteria [29].
The review was conducted following the steps suggested by the Parsifal tool, which in turn was developed from the studies by Kitchenham [30] that proposes that a systematic review should involve the following phases: Planning, Conduction and Report of the Review. This methodology is represented in Figure 1 and discussed in the subsections III-A III-B e III-C.

A. PLANNING
The objective of this work is to identify studies that address techniques or algorithms used in emotion recognition based on Artificial Intelligence. At this stage, the basis of SLR is reformulated [30], as the formulation of detailed objectives and the elaboration of a research question to be answered is essential for any project. In addition, the main search and selection parameters that structure the study are defined.
The PICOC methodology, as described in [31] (also found in [29]), presents five fundamental elements: population, intervention, comparison, outcomes and context, which are used to describe all components related to the identified problem and structure the research questions. Population, intervention and comparison are not relevant to the research at hand, so they were not used. The outcomes were determined as artificial intelligence algorithms and techniques and the context as emotion recognition, face detection and affective computing.

2) RESEARCH QUESTIONS
Specifying research questions is one of the most important parts of any SLR. Review questions are responsible for guiding the entire systematic review methodology: a) What are the most used deep learning models for emotion recognition? The context of Facial Expression Recognition (FER) brings together numerous methods created to date. While deep learning models have a totally versatile applicability, they are widely employed in FER [32]. With the remarkable success of deep learning, specifically to artificial neural networks (ANN) and convolutional neural networks (CNN), the different types of architectures of this technique are explored to obtain better performance in facial expression recognition. That said, the evaluation of the deep learning models addressed in the research was relevant to SLR; and b) Which databases are used in face recognition research? Deep learning models are created to train computers to perform tasks like humans, such as voice recognition, ability to identify traits, and even emotion and image recognition [33]. This requires a large amount of data to train these algorithms. As a CNN is developed, the datasets chosen for training must be in accordance with its proposal. Since these data directly influence the results and performance of the algorithms, it became necessary to evaluate which databases were applied in each article evaluated in the SLR.

3) KEYWORDS AND SYNONYMS
Following the PICOC protocol used in the Parsifal platform, the definition of keywords is necessary in order to create the search string and perform the import of articles. The choice of keywords was made in a consensus manner by the authors themselves, being selected to encompass all words that directly relate to the study of facial expression detection methods. Since the search covers articles published in both English and Portuguese, the keywords must also be in both languages. Table 2 presents the keywords used, as well as their synonyms for searching.
Based on the keywords and synonyms, the search string for this study was configured according to Table 3: OR and AND being Boolean values, double quotes for compound words and parentheses to logically separate the keywords and synonyms.

4) SOURCES
Papers published in national and international events and journals in the field of informatics were searched, as these are the main instruments for the dissemination of works of this nature: ACM Digital Library, 2 Google Scholar, 3 IEEE Digital Library, 4 SciELO Citation Index 5 and Scopus. 6 These databases were chosen according to the selection made in [34], which highlights those that have been consolidated in terms of SLR development. Table 1 represents the coverage of each of the databases used to search the articles. According to the table, both the IEEE Digital Library and the ACM Digital Library stand out for their recognition in the dissemination of articles focused on Advanced Technology and Computing topics. Thus, the search in these libraries is relevant to this research, as it covers areas of interest in the context of models and techniques for facial expression detection. On the other hand, Google Scholar was selected for its comprehensiveness in terms of databases and knowledge areas, as well as Scopus, which covers several thematic axes. This generalization is important due to the many applications that facial expression detection models have throughout the scientific world [18], [35], [36], and [37]. Finally, SciELO stood out for its distinctive preference for works produced in Latin America.

5) INCLUSION AND EXCLUSION CRITERIA
In [30] the importance of planning a second stage of article selection to provide even more detailed inclusion and exclusion criteria for article eligibility is highlighted. These criteria for the first phase were defined as follows: Inclusive Criteria (IC): a) Studies addressing the use of Artificial Intelligence techniques for emotion recognition; and Exclusive Criteria (EC): a) The study is duplicated; b) The study does not address Artificial Intelligence techniques for emotion recognition through face detection; c) The study has incomplete information.

B. CONDUCTION
In this phase there is the process of importing and selecting articles from the databases. The objective is to find all studies whose search questions were obtained through search strings [28]. Throughout the search and import of articles in the selected databases (see Table 1), only articles published between January 2018 and November 2022 were filtered, as planned by the team. The search for papers, based on the search key in the databases specified during planning, resulted in the identification of 350 articles, as described in the section IV. From these 350 articles, a reading of their abstracts was performed to select the most relevant papers for    the review, taking into account the inclusion and exclusion criteria defined in the planning stage.
Thus, from the 350 articles accepted in the article importation stage and the elimination through inclusion and exclusion criteria, a new article selection was performed, but evaluating the quality of the work based on the following questions: 1) Is the research objective clearly described? 2) Has the study performed a well-described experiment to evaluate the proposal? 3) Does the study identify an experiment to evaluate the technique used? 4) Does the study show the effectiveness or efficiency of the technique used? 5) Is the database used informed? 6) Does the research show a described applicability? 7) Is the database reported in the public domain?
These questions comprise a quality assessment from reading the articles as a whole; [38] mentions a checkbox-based methodology with a questionnaire to assess the rigor and relevance of the studies, assigning a quality score to them.
The answers ''Yes'', ''Partially'' and ''No'' have the respective weights: 1.0, 0.5 and 0.0. Thus, the maximum score for the evaluation of each article is 7 (seven) points. For the paper to pass to the qualitative review, it should have a minimum score of 5.0 points.

C. REPORT OF THE REVIEW
In this phase, the analysis and discussion of the results relevant to the study was carried out. Thus, the works selected in the last stage were analyzed aiming at the extraction and mapping of data corresponding to the research questions. The details of this methodological step are described in Sections IV and V. Figure 2 exemplifies the entire systematic review process performed within the Parsifal platform, from the import of articles into the databases, to the quality assessment for the analysis of the results described in Section V.

IV. RESULTS
First, the search string was formulated, as presented and explained in Section II and Table 3, respectively. This string was created based on the keywords, synonyms and specifications of this search, exemplified in Table 2 and Table 3. It was important either to guide the search within the databases or to better identify the relevant studies to this systematic review. The search for articles was performed in the databases informed and explained in the subsection III-A4, which were ACM Digital Library, Google Scholar, IEEE Digital Library, SciELO Citation Index and Scopus.
The application of the search string to the bases prevailed in the IEEE Digital Library, Scopus, Google Scholar, ACM Digital Library libraries, as per the structure described in Table 3, returning significant results for the study of AI models applied to facial expression detection. However, the search in the SciELO Citation Index library was done differently, because the string format proposed in Table 3 is not compatible with the database search engine. This impasse resulted in few articles being selected, because the search was done in parts, i.e., each search related to one of the keywords, shown in Section II.
After searching the databases, the articles were imported. This step was done by exporting the BibteX format in the database platform itself. This format was used because it is the one indicated in the Parsifal platform, as it contains more complete information about the works. From this, the BibteX references chosen in the search were loaded into Parsifal.
At the end, all the selected articles were stored in the platform to be analyzed precisely in the next steps, see Table 2. Regarding the amount of articles selected per library, it can be observed that IEEE Digital Library had the highest abstraction of articles, with 136, while Scopus and Google Scholar returned 95 and 9 searches, respectively. On the other hand, the ACM Digital Library and SciELO Citation libraries returned exactly 85 and 25 articles, thus totaling about 350 articles. Figure 3 represents the distribution of the selected articles by data source. After the selection phase of the 350 articles, the next step abstracted the relevant articles to the review. To do this, four criteria were applied, both for inclusion (IC) and for exclusion of articles (EC), as highlighted in Section III-A5. This step was necessary to select the works that were within the research interest, thus excluding the articles that did not meet the pre-selected criteria. During the application of the IC and EC, it was observed that more than half of the articles were discarded (about 65%), most of the studies for addressing incomplete information and for not presenting Artificial Intelligence techniques for emotion recognition through face detection. In addition, duplicate studies were automatically removed by the Parsifal platform.
This selection step resulted in 227 discarded articles, leaving only 123 articles, 84 from the IEEE Digital Library, 16 from the ACM Digital Library, 8 from Scopus, 8 from SciELO Citation Index, and only 7 from Google Scholar. The Figure 4 graphically represents the relationship between the number of articles imported (blue color) and accepted (red color) after applying the Inclusion and Exclusion Criteria in each of the databases used.
In the last step, the quality assessment was performed, which aims to score the 123 articles abstracted in the last step to thus qualify them. This step required the most detailed and accurate analysis of the articles, as it was crucial for deciding the final results of the systematic review. Thus, its execution will decide the best research according to the specifications of the research questions (presented in Section III-B) and, consequently, will enable the choice of the best AI techniques addressed in the total sample space of the selected articles.
As explained in Section III-B, 7 research questions were defined during the planning of the review. All served as the basis for scoring the 123 research questions from the three criteria: Yes, Partially and No; with the respective weights: 1.0, 0.5 and 0.0. From these criteria and the quantities of questions to be analyzed, the maximum score acquired was 7.0, while the minimum score chosen for the evaluation is 5.0.
After the quality evaluation of the 123 articles, about 76.5% of the evaluated papers had grades above 5.5, and 23.5% had grades below or equal to 5. From these results, to fulfill the evaluative method and analyze the best AI techniques and algorithms, the best articles were selected, i.e., those with scores equal to 7.0. Then, 11 articles with top scores were abstracted, which corresponds to 17.3% of a total of 123 articles selected from the Inclusive and Exclusive Criteria. The next section precisely analyzes the results obtained with the 11 articles selected after the entire systematic review process. As explained in the section III-B, which describes the evaluative method, this analysis is done on top of the research questions, commented in III-A2.

V. DISCUSSION
In the selected articles, facial image recognition was one of the most relevant factors for the identification of facial expressions, where images were recognized from a database. Among the databases used, it is worth mentioning that databases such as CK+, 8 JAFFE, 9 and FER-2013 10 stand out from the others, having better performance in CNN's due to the fact that the images are treated in the laboratory; thus, not having image noise, shadows and external interferences, making the image recognition algorithms have a better performance in their accuracy, as reinforced in [39]. Table 4 presents the models employed in the selected articles. It is notable the prominence of methods using Convolutional Neural Networks (CNN), which are known for their great performance in object detection tasks in Computer Vision. The traditional architecture of a CNN is composed of three layers, as used in [48], where the first two correspond to the process of extracting features from the image and the third, which is a Fully Connected (FC) layer, maps the extracted features as output data for the classification task [49]. This architecture presents an advantage for CNNs over traditional neural networks, as the use of convolutional layers allows for the reduction of image dimensions to decrease processing time, without losing important features for good prediction. It was observed that all the selected articles use some CNN architecture to develop a face recognition model. This is because CNNs have demonstrated good results for image segmentation and face analysis by regions, and are considered a good strategy to identify different facial expressions, as described in [44], [47], and [50]. The recurrent CNN model is the VGG-16, 11 present in articles [45] and [50]. VGG-16 was originally developed for large-scale classification of natural images, having 13 convolution layers and 3 fully connected layers (FC). A pre-trained VGG-16 is very powerful for various applications in image segmentation, and is commonly trained with the CK+ database.
According to the evaluation, CNN's such as VGG-16, VGG-2D and ResNet50 are the most efficient for facial expression recognition. As shown in the selected articles, CNN's with optimization methods have more accuracy than CNN's without optimization. Therefore, some of the most used techniques are: 'Prediction', 'Face cropping' and 'Landmarks'. Both techniques seek to optimize databases for better facial expression recognition.

B. WHICH DATABASES ARE USED IN FACE RECOGNITION RESEARCH?
All articles used some public database to train their CNN models. Figure 5 shows the distribution of articles per database used in this work; the colors represent each database and the sizes the quantities of articles; thus, about 5 articles, which cite the use of the public database CK+ made available by the Kaggle 21 platform and selected because it is a database created for face recognition applications. CK+ has images, at resolutions of 640 × 490 or 640 × 480 pixels, of people aged 18 to 50 performing 23 facial displays, plus considers data augmentation to improve accuracy when training an ML model. The training and testing division of the database is done by the user. The second most used database (23%) was the RAF (Real-world Affective Faces), 22 which stands out in facial expression recognition tasks, since the images in this dataset have a large variance of characteristics, such as age, sex, ethnicity, lighting conditions and others. This variety of  characteristics increases the learning of models, making it possible to distinguish expressions in different environments and people. The dataset has 29,672 real-world images, and the database has been divided into a training set and a testing set, where the size of the training set is five times larger than the testing set.
Another base found and used by about 15% of the articles is FER-2013, 23  Other works also chose to develop their own databases, such as LFFW (Light Field Faces in the Wild) and LFFC (Light Field Faces Constrained), 24 used by 1 article (7%) and composed completely of images without laboratory treatment, with the objective of training models that recognize facial expressions from several different angles and with elements such as shadows and light. In addition to the databases already mentioned, it was observed the use of bases such as AVEC 2013 and 2014 (7%), AffectNet (15%), FERG and MUG (7%), RAF (23%), KDEF (7%), MMI (15%) and JAFFE (7%). 24 It was not possible to access the datasets in question because they are private databases. VOLUME 11, 2023 C. SUMMARY AND DISCUSSION OF THE SELECTED ARTICLES Table 5 summarizes the references, techniques and performance of the deep learning models, selected in the systematic review.
The selected articles show their performances and their databases. One can highlight the excellent performances that [44] and [47] obtained in their face and expression recognition.
The article [44] stands out among the others because it has an execution in more than one cluster, a technique called KMP (kernel multiview projection), presented in [51], which incorporates different feature representations, being able to exploit the complementary property of different views w include multiple metrics across multiple kernels. Furthermore, the research shows facial expression recognition rates through statistical rates, such as MDA, SAML, HERML and MSML, inserted after the CNN execution in KMP. The MSML method stands out, with 99.29% ARR (Average Recognition Rate) when combined with controlled databases such as CK+.
Reference [47] has an excellent performance for facial expression recognition, which is due to the use of merged CNN's, such as GAN (Generative Adversarial Network) and VAE (Variational Auto-Encoder), which gives rise to PPRL-VGAN, making it have a high impact on the final result. Concomitant to CNN's, databases have a large impact factor on facial expression recognition results.
In most of the other articles, with the exception of the previously mentioned ones, conventional CNN's such as VGG and ResNet are used with more conventional and well-known methods such as Adagrad and Adam (well-known algorithms for parameter optimization), used mainly in article [47] in order to improve the quality of the resulting model from the pattern configuration of the hyperparameters. It is also worth mentioning that facial expression recognition, with datasets that have laboratory controlled images, did better in facial expression recognition tests using ''in the wild'' images, as in the articles [42], [44], and [45].
CNN's such as ResNet50 and VGG-16 perform highly on the selected articles, and when combined with datasets, such as CK+ and FER, the performances, such as accuracy and average, are increased in drastic ways. Optimization techniques are also important in defining the outcome of a CNN for its specific purpose. Thus, it is clear that for a good performance of a CNN, optimization methods and databases are needed, with images treated in the laboratory.
Thus, after the analysis of the selected articles, it was explicit that CNN's combined with optimization methods and trained with lab-controlled image datasets perform better when tested on ''in the wild'' data, but do not obtain the same performance when first trained on ''in the wild'' data and later tested on lab-controlled image datasets.
It is emphasized that face recognition models can be used in a variety of applications, as could be observed in this systematic review. Among the articles analyzed, the areas with applications are Visual Computing (70%), Health (15%), Information Security (7%), and Human-Robot Interaction (7%).

VI. CONCLUSION
This article presented a mapping and summarizing of studies that address techniques or algorithms most used in facial expression recognition based on artificial intelligence. With this, it was possible to identify that there are numerous models of CNN's and databases with great effort, but those that read from databases controlled in the laboratory without external interference stand out, according to [41], [44], and [47].
In general, the analyzed studies covered several aspects of CNN's, from the most common errors when recognizing expressions, to the difficulties found in recognizing faces in several environments, described in the [46]. Furthermore, it was highlighted the great use of CNNs such as VGG and ResNet50, which combined with databases such as CK+, obtained high performance in their tests, used in [52].
However, VGG may be preferred in some situations because it has a simpler network structure that is easier to understand and implement. Also, VGG has a relatively smaller number of parameters, which may make it easier to train with a smaller data set.
On the other hand, ResNet50 is a deeper and more complex architecture, which is able to achieve better results in some computer vision tasks, such as image classification on large datasets, due to its ability to learn more complex and abstract features. In addition, ResNet50 is less prone to overfitting, which is an advantage on large and complex datasets.
Other important algorithms that should be mentioned are YOLOv3 and VGG-16. VGG-16 is a convolutional neural network used primarily in image classification tasks, while YOLOv3 is an object detection neural network used to locate and classify objects in an image.
VGG-16 is known for its effectiveness in image classification, and is capable of achieving high accuracy results on various benchmark datasets. In addition, VGG-16 is relatively simpler compared to other deep neural network architectures, making it easier to implement and train.
Although the evaluation method presented has selected few articles in relation to the initial base of 350 articles, the importance of the data obtained for future research is emphasized. In addition, it is suggested that new publications continue to contribute to the growing range of work on facial expression recognition, with new CNN models and databases, generating new performances and new ways of recognizing expressions and faces.
It is hoped, therefore, that with this systematic review, it will be possible to inform and motivate professionals from various fields, specifically researchers and developers, to research and develop their own CNN's for face and expression recognition, allowing more data to be collected for research, helping future professionals in the area.
As with most research, this also has some limitations. A first possible limitation was to restrict the search period 61888 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
between January 2018 and November 2022, which was necessary to enable the analysis of the selected articles to begin, which took several months. In view of this, we automatically missed other interesting articles that could be evaluated during the period of article analysis up to the submission date.
In addition, the proposed method involved applying exclusion/inclusion criteria to find relevant articles that were possible to evaluate. This decision may have excluded short articles presenting original ideas or articles without sufficient evidence. Another factor that limits research in the PICOC method is the lack of flexibility.
Another factor that limits research in the PICOC method is the lack of flexibility. The PICOC framework can be quite restrictive, which can limit the researcher's ability to make adjustments and modifications as the research progresses. For example, it may be necessary to adjust the PICOC framework to deal with a lack of relevant data or new findings that arise during the research.
Even though it is not highlighted as a limitation, one can point out as a difficulty the fact that the review should be deep in the analyses, which demanded more attention and time from the team; such attention and time required double effort given the countless research projects that the group members develop in their areas of knowledge, since it is a multidisciplinary team.
As future work, new databases will be added to the SLR, and new words may be incorporated into the search string. It is worth mentioning that this article is part of a research project that aims to develop a tool for emotion detection in scenarios of violence against women [53], [54].
LYANH VINICIOS LOPES PINTO received the degree from the Institute of Technology, Federal University of Pará (UFPA). He was a fellow in extension projects in the area of artificial intelligence and its subareas. He is a current fellow with the Pro-Rectory of Extension (PIBEX), in the area of Unity 2D game development. He is currently with the Operational Research Laboratory (LPO) developing research in the computational area applied to solving social problems. His research interests include applied computing, artificial intelligence, computational, and web and mobile development.