A SYSTEMATIC REVIEW OF METHODS OF EMOTION RECOGNITION BY FACIAL EXPRESSIONS

Facial recognition is integral and essential in today's society, and the recognition of emotions based on facial expressions is already becoming more usual. This paper analytically provides an overview of the databases of video data of facial expressions and several approaches to recognizing emotions by facial expressions by including the three main image analysis stages, which are pre-processing, feature extraction, and classification. The paper presents approaches based on deep learning using deep neural networks and traditional means to recognizing human emotions based on visual facial features. The current results of some existing algorithms are presented. When reviewing scientific and technical literature, the focus was mainly on sources containing theoretical and research information of the methods under consideration and comparing traditional techniques and methods based on deep neural networks supported by experimental research. An analysis of scientific and technical literature describing methods and algorithms for analyzing and recognizing facial expressions and world scientific research results has shown that traditional methods of classifying facial expressions are inferior in speed and accuracy to artificial neural networks. This review's main contributions provide a general understanding of modern approaches to facial expression recognition, which will allow new researchers to understand the main components and trends in facial expression recognition. A comparison of world scientific research results has shown that the combination of

Facial recognition is integral and essential in today's society, and the recognition of emotions based on facial expressions is already becoming more usual. This paper analytically provides an overview of the databases of video data of facial expressions and several approaches to recognizing emotions by facial expressions by including the three main image analysis stages, which are pre-processing, feature extraction, and classification. The paper presents approaches based on deep learning using deep neural networks and traditional means to recognizing human emotions based on visual facial features. The current results of some existing algorithms are presented. When reviewing scientific and technical literature, the focus was mainly on sources containing theoretical and research information of the methods under consideration and comparing traditional techniques and methods based on deep neural networks supported by experimental research. An analysis of scientific and technical literature describing methods and algorithms for analyzing and recognizing facial expressions and world scientific research results has shown that traditional methods of classifying facial expressions are inferior in speed and accuracy to artificial neural networks. This review's main contributions provide a general understanding of modern approaches to facial expression recognition, which will allow new researchers to understand the main components and trends in facial expression recognition. A comparison of world scientific research results has shown that the combination of traditional approaches and approaches based on deep neural networks show better classification accuracy. However, the best classification methods are artificial neural networks.  Pre-processing allows you to cope with the problems described earlier [6,7]. The face area's localization is performed at this stage, cropping and scaling the found area, changing the image's contrast. Feature extraction is based on the geometry and appearance of the face. By geometry, we mean the following components of the face: (their shape and location on the face), like the eyes, mouth, nose, etc., and under the appearance of the face is the skin's texture.
The classification of features is aimed at the development of an appropriate classification algorithm for identifying facial expressions. This analytical review's primary purpose is to compare the methods of pre-processing facial images, extraction of visual features, and machine classification of emotions. This will allow us to determine the further direction of research for creating a new automatic system for recognizing human emotions by facial expressions. Figure 2 shows a general diagram of the image analysis method for identifying emotions based on a person's facial expression. The paper presents a brief overview of the databases of facial expressions, the primary processing methods, extraction and classification of features used for the FER task, and the current results of some existing algorithms are presented, as well as a comparison of FER methods.

Overview of facial expression databases
Databases of emotional facial expressions are divided into static and dynamic images in a sequence of frames. Static images capture only the peak level of intensity of the emotion being experienced, while dynamic images capture facial expressions that change over time. To create FER systems, it is promising to use databases containing video sequences. Table 1 provides a summary of some existing databases. Image pre-processing Image pre-processing allows one to cope with such problems as a lack of data on facial expressions, intra-class differences and inter-class similarities, small changes in the face's appearance, changes in the head position, and illumination improve the accuracy of FER systems. Image pre-processing can include the following steps: localize the face area, crop and scale the found area, align the face and adjust the contrast. 1. Localization of the face area allows one to determine the face's size and location in the image. The most commonly used localization methods: o Viola-Jones object detection (VJ) [10]. o Single shot multi box detector (SSD) [11]. o Histogram of directional gradients (HOG) [12] . o Max margin object detection (MMOD) [13]. 2. Cropping and scaling the face area's found area is carried out according to the face area's localization methods' coordinates. Since the located areas of the face have different sizes, it is necessary to scale the image, i.e., bring all the images to the same resolution. For these tasks, the following are applicable.  Bessel's correction [14].  Gaussian distribution.
For example, the application of these methods is presented in [15,16]. 3. Face alignment reduces intra-class differences. For example, for each facial expression, a reference image is selected and divided by color components or the most informative areas of the face (for example, the forehead, eyes). The remaining images are aligned relative to the reference images. The following methods are used for this task.  Scale invariant feature transform (SIFT) [17].  Region of interest (ROI) [18]. Examples of the use of these methods are presented in [19,20]. 4. Contrast adjustment allows one to smooth out images, reduce noise, increase the contrast of the face image, and improve saturation, enabling you to cope. For example, with the problem of illumination. Methods of contrast adjustment are:  Histogram equalization [21].  Linear contrast stretching [22].
Examples of the use of these methods are presented in [23].
Various image manipulations, such as rotation or displacement, allow you to increase the diversity of images and expand databases. Extending the database with variable images is helpful for deep learning-based methods. While for traditional methods, it is more beneficial to use models based on face alignment, which, on the contrary, reduces the variability associated with changes in the head position, which allows you to minimize intra-class differences and increase inter-class similarities. In this case, we selected the following models for each class: it has its standard, according to which each class's images are aligned. Choosing a suitable pre-processing method takes a long time, as facial recognition speed and accuracy depend on it.

Extracting visual features
The next stage of the FER is the extraction of signs. At this stage, the elements that are most informative for further processing are found. Depending on the functions performed, methods for extracting informative visual features are divided into several main types. Each feature extraction method is described in detail below. 1. Methods based on geometric objects allow one to extract information about geometric objects, such as the mouth, nose, eyebrows, and other objects and determine their location. With details about geometric objects and their locations, you can calculate the distance between the objects; the obtained distances and coordinates of the objects' positions are signs for further classification of emotions. Methods based on geometric objects include:  The Line Edge Map (LEM) descriptor [24] measure facial expressions' similarity based on objects' boundaries. An example of using the descriptor is presented in [25].  Active Shape Model (ASM) [26] detects objects' edges using facial landmarks, representing a chain of sequences of feature points.  Active appearance model (AAM) [27] is an extended version of ASM, which also forms the textual features of facial expressions. In [28], examples of using the methods are presented ASM and ASM.  HOG allows you to compare facial expressions by the direction of the gradients. An example of using HOG is shown in [6].  Curvelet Transform [29] transmits information about the object's location and spatial frequency. An example of using the transformation is shown in [30]. 2. Methods based on appearance models will allow you to extract information about the textural features of the face. For example, the different number of wrinkles in the eye area indicates different facial expressions, so using methods based on appearance models is the most informative for classifying emotions. Methods based on appearance models include:  The Gabor filter [31] is a classical method of distinguishing facial expression features, allowing you to select different deformation models for each emotion.  Local Binary Patterns (LBP) [32] allow you to represent the image pixel areas in binary code containing the distinctive features of both local and global (texture) areas of facial expressions.  Local Phase Quantization (LPQ) is resistant to ISO blurring. The method is based on short-term Fourier transformation, which allows us to identify periodic components in images of facial expressions and assess their contribution to the initial data formation. Examples of using the LBP and LPQ methods are presented in [33].
 The Weber local descriptor [34] extracts features in two stages. The first stage divides the image into local areas (such as mouth and nose) and normalizes the images. The second stage extracts distinctive textural features using the gradient orientation describing facial expressions. An example of using this method is provided in [35].  Discrete wavelet transform (DWT) [36] extracts texture features by splitting the original image into low-and high-frequency sections. For example, the authors of the article [37] use this descriptor. 3. Methods based on global and local objects are:  The Principal Component Analysis (PCA) method [38] extracts the distinctive features of facial expressions from the covariance matrix, reducing the vectors' dimension.  Linear discriminant analysis (LDA) [38] searches for vectors with the most significant differences between classes and groups the same class's features. Examples of using the methods in PCA and LDA are presented in [1].  The optical flow (OF) [39] assigns a velocity vector to each pixel, so information about the facial muscles' movement is extracted from the sequence of images, which allows us to consider the deformation of the face in dynamics. For example, the authors [40] use this method.
From the presented feature extraction methods, appearance-based methods are the more different methods since they allow extracting textural features of appearance, which are essential parameters for FER, but they are less adapted to occlusions. Methods based on geometric objects are better adapted to occlusion, and the distance between facial landmarks is more characteristic of inter-class differences. The use of hybrid methods to extract features of facial expressions increases the accuracy of classification.  Hausdorff Distance [41] allows you to measure the distance between areas of interest and compare the resulting distances between classes. An example of using the method is presented in [7].  Minimum distance classifier (MDC) minimizes the distance between classes. The smaller the distance, the greater the similarity between classes. An example of using the method is presented in [2].  The k-nearest neighbor algorithm (KNN) [42] classifies facial emotions by the first k-elements whose distance is minimal. An example of using the method is presented in [3].  LDA presents matrix data in the form of a scatter plot, which allows you to separate the features of classes visually. For example, the authors [43] use the LDA method.  The Support vector machine (SVM) method [44] builds a hyperplane separating the sample objects. The greater the distance between the separating hyperplane and the shared classes' objects, the lower the classifier's average error. For example, the authors in [1,3,4] [5,6]. o Recurrent neural network (Recurrent neural network (RNN) uses contextual information about previous images. So, one training set contains a sequence of images, the classfixing of the entire set will correspond to the classification of the last image. o The combination of convolutional and recurrent neural networks can extract local data and use temporal information for more accurate image classification. For example, the authors in [49] use a combination of convolutional and recurrent neural networks.
Other DNNs are modifications of the presented neural networks. The following metrics are used to analyze classification methods' effectiveness: completeness, accuracy, and F-measure. With a good set of training data, the best classification method is deep neural networks. They automatically study and extract features from the input images and detect emotions on the face with higher accuracy and speed than other methods.

Comparison of facial expression recognition algorithms
Since the FER algorithms' accuracy depends on the database used for training and testing, it is correct to compare different methods on the same data sets. So, to compare the methods, the results obtained from the FER2013 database were studied on static images. The SAVEE database was used to evaluate the recognition algorithms for dynamic changes in facial expressions. To compare the research results, the databases widely used by researchers for this task were selected.

Comparison of results on the FER2013 dataset
The authors in [6] compared the FER algorithms with feature extraction using the HOG and CNN methods. They found that with a small 48 × 48 image expansion and low image quality, the HOG and CNN methods' results are not adequate compared to CNN. The authors in their work [5] experimented with CNN parameters. As a pre-processing method, the authors used the Viola-Jones method. As a result, the localized areas of the face had a resolution of 48 × 48 pixels. Figure 4 shows the CNN architecture proposed by the authors in [5].
The authors of the article [49] proposed to use a multilayer maxout activation function (MAP), which allows coping with such a problem in deep learning as a gradient explosion. To extract visual features, the authors used CN and RN with long short-term memory (LSTM). The combination of models also allows us to obtain information about the dynamic characteristics of facial expressions. The authors used SVM to classify emotions. All images were randomly cropped to 24 × 24 pixels.
The authors in [50] used the JAFFE data set as a set of k-means training data. This was necessary to obtain a set of cluster centers. The k-means method is a clustering method used to divide n observations into k-clusters. Furthermore, the cluster centers' obtained values were fed as the convolution kernel's initial values in the CNN. The classification in the algorithm proposed by the authors was performed by SVM. For pre-processing, the authors used Viola-Jones, resulting in localized areas the faces had a resolution of 48 × 48 pixels. Table 2 summarizes the FER methods used and the classification accuracy on the FER2013 dataset, presented in [5,6,49,50].
According to the final (Table. 2), it can be noted that the authors used the Viola-Jones pre-processing method. We used methods based on geometric objects (ASM, HOG, SIFT), global and local objects (PCA) to extract features. In some of the papers presented, CNN and the CNN + LSTM combination were the preferred feature extraction methods. The authors used SVM, KNN, and CNN as classifiers. However, the authors showed the highest result [5] using the Viola-Jones method and CNN, experimenting with CNN parameters. The accuracy of the analysis was 89.78 %. From this, we can conclude that CNN copes with classification more accurately than traditional classification methods, such as SVM and KNN. VJ -CNN 89.78% Figure 5:-Neural network architecture in the emotion recognition system [5].

Comparison of results on the SAVEE dataset
As mentioned earlier, the FER2013 database consists of static images, but since facial expressions change dynamically, it is necessary to compare the results obtained on a database containing video recordings. A database will be used to compare the FER algorithms SAVEE.
The authors in [51] use a new hybrid structure called Tandem modeling (TM), which consists of two hierarchically connected neural networks with a direct connection of the same type. The first network is a Not Fully Connected Neural Network (NFCN), the second is a standard fully connected neural network (FCN). Both networks use a hidden bottleneck connection layer that allows all outputs to be combined. As a preliminary treatment, the authors used the following methods: Viola-Jones localization, as a result of which the localized areas of the face had a resolution of 128 × 128 pixels, LBP was used on three orthogonal planes (Local binary patterns on three orthogonal planes, LBP-TOP). Other LBPs matrices of each class were independently distributed through 5-layer MLP and fed to the neural network. Table 3 summarizes the FER methods used and the classification accuracy on the SAVEE dataset, as presented by the authors in [33,40,51].
The authors [40] used the following approach in their work. Rough and precise networks with an automatic encoder were responsible for pre-processing (coarse-to-fine auto-encoder networks, CFAN), allowing them to gradually optimize and align facial expression images. As a result, the localized areas of the face had a resolution of 64 × 64 pixels. Two sets of functions were responsible for extracting global features: CNN and Gist, which use Gabor filters. Local features were extracted by LBP and LPQ-TOP. The authors used a discriminative multiple canonical correlation analysis (DMCA) to determine these indicators' characteristics to combine local and global. To reduce the functions' dimension, the Kernel entropy component analysis (KECA) algorithm was used. The SVM classifier terminates the algorithm.
The authors [40] used the methods OF and Cumulative OF to extract the features. optical flow, AOF). 3D CNN was chosen by the authors because it allows us to extract features from spatial and temporal dimensions. In 3D CNN, the core moves in three directions, and the input and output data in this network are 4-dimensional. To select a suitable set of hyperparameters, the authors chose a Bayesian optimizer. The Viola-Jones methods and histogram equalization (HE) were applied to pre-process the image, resulting in those localized areas of the face had a resolution of 64 × 64 pixels.
Total, based on the results obtained from the database SAVEE (table. 3), we can conclude. The following methods were used at the preliminary image processing stage: Viola-Jones, histogram alignment, and CFAN. The following methods were used at the feature extraction stage: LBP, LPS, OF, CNN, and Gist. The following methods are selected to classify facial expressions: SVM, MLP, TM, 3D CNN. Thus, the highest accuracy was obtained by the authors of the work [51], the recognition accuracy was 100 %. The authors used the Viola-Jones method for image pre-processing. The method is a local binary pattern on three orthographic surface planes and was selected at the feature extraction stage. Next, tandem modelling was used. The accuracy of the FER is high and is mainly achieved by artificial neural networks. However, the articles' authors did not conduct experiments of the trained models they obtained with other datasets, indicating that the trained models are not universal.

Conclusion:-
Automatic facial expression recognition is an essential component of multimodal interfaces of human-computer interaction and computer paralinguistics systems. For accurate recognition of emotions from facial expressions, it is necessary to choose suitable methods for pre-processing the image, extracting visual features of facial expressions, and classifying emotions. Today, traditional classification methods are inferior in speed and accuracy to artificial neural networks. However, despite the large number of experiments carried out, the accuracy of the algorithm's facial expression recognition is still not high enough for various input parameters, so the task of creating a universal algorithm remains relevant.
For further research, the authors plan to combine the SAVEE databases, CREMA-D, and RAVDESS, which will solve the problem of lack of data for training models. In addition, it will train the classifier to automatically recognize emotions regardless of ethnicity, age, or gender. To localize the face area, it is assumed to use an active shape model, which will allow you to divide the edges of objects using facial landmarks. This method shows high speed and accuracy face area detection. The images will be scaled with a resolution of 48 × 48, 64 × 64, 128 × 128, 224 × 224 pixels, which will allow us to conclude about the effect of resolution on the accuracy of facial expression recognition. Also, to adjust the contrast, the histogram equalization method will be used, which will reduce the noise and the dependence of the facial expression recognition algorithm on the illumination. Using the active shape model will also allow you to extract features about the distance between the face landmarks and the center of mass, and the angle turning the head. It is planned to use CNN to classify emotions based on the obtained pixel arrays with different resolutions.
It is planned to use FCN and SVM methods to classify emotions based on the extracted features (the coordinates of the facial orientations, the distance to the center of mass, and the angle of rotation of the head).