Swin-FER: Swin Transformer for Facial Expression Recognition

: The ability of transformers to capture global context information is highly beneficial for recognizing subtle differences in facial expressions. However, compared to convolutional neural networks, transformers require the computation of dependencies between each element and all other elements, leading to high computational complexity. Additionally, the large number of model parameters need extensive data for training so as to avoid overfitting. In this paper, according to the characteristics of facial expression recognition tasks, we made targeted improvements to the Swin transformer network. The proposed Swin-Fer network adopts the fusion strategy from the middle layer to deeper layers and employs a method of data dimension conversion to make the network perceive more spatial dimension information. Furthermore, we also integrated a mean module, a split module, and a group convolution strategy to effectively control the number of parameters. On the Fer2013 dataset, an in-the-wild dataset, Swin-Fer achieved an accuracy of 71.11%. On the CK+ dataset, an in-the-lab dataset, the accuracy reached 100%.


Introduction
The transformer is a kind of neural network model which utilizes self-attention mechanisms to establish dependency relationships between sequences, and it has attracted wide attention for its excellent performance in natural-language processing tasks [1].Many researchers have tried to apply the transformer model to facial expression recognition tasks.This mechanism enables the model to consider all other elements in a sequence while processing each sequence element, which is especially useful for capturing subtle variations in expressions [2].Compared with convolutional networks, the transformer is not constrained by local receptive fields, allowing it to flexibly focus on any part of an image.Moreover, its self-attention structure supports better parallel processing [3,4].However, this self-attention mechanism involves computing dependencies between each element and all other elements, resulting in high data dimensions and computational overheads [5].Additionally, the model includes numerous parameters and usually needs a lot of data for effective training to mitigate overfitting.Consequently, the model is larger, requiring more computational resources for both training and inference [6].
The Swin transformer, a variant of the transformer, has been specifically optimized for visual tasks.Introducing window partitioning mechanisms and hierarchical structures enables the model to learn features of different scales [7].In this paper, based on the characteristics of facial expression recognition tasks, we made targeted modifications to the original Swin transformer network structure, proposing the Swin-Fer network (Swin Transformer for Facial Expression Recognition), which achieved promising experimental results.The main contributions are as follows: 1. Swin-Fer adopts a fusion strategy from middle to deep layers to capture facial expression features more accurately.This guides the network to learn the relation-ships between local and global features effectively, thus enhancing the ability of expression recognition.2. Using the data dimension transformation strategy, the whole network model can perceive more spatial dimension information.In addition, in order to improve the generalization ability of the model, the mean module and the split module, as well as group convolution, are introduced.While achieving satisfactory experimental results, the parameter count is kept largely unchanged.3. The proposed method achieves an accuracy of 71.11% on the Fer2013 dataset under natural conditions and 100% on the CK+ dataset in laboratory environments.The sensitivity and specificity of the model, as indicated by the area under the curve, are also good.

Related Works 2.1. Transformer for Facial Expression Recognition
As a model of a self-attention mechanism, the transformer model demonstrates excellent performance and adaptability to various facial expression recognition tasks [8].Xue et al. [9] applied the transformer method to facial expression recognition tasks and developed the TransFER model by integrating strategies such as Multi-Attention Dropping (MAD) and Multi-Head Self-Attention Dropping.This model highlights important local blocks (patches) while suppressing irrelevant ones, thereby exploring rich relationships between local blocks and addressing to some extent the issue of "large inter-class similarity, small intra-class similarity" in facial expression recognition tasks.Ma et al. [2] proposed the VTFF (Visual Transformer with Feature Fusion) network, which introduced the Attentional Selective Fusion (ASF) method to leverage two feature maps generated by a dual-branch CNN.Through global-local attention mechanisms, multiple features are fused to capture discriminative visual words, and the method achieved excellent performance.In facial expression classification tasks, many researchers input image samples containing various emotional states into transformer models and use a softmax classifier to predict the emotional labels represented by each image.Kim et al. [10] observed that the Vision Transformer (ViT) may have limited ability to capture subtle changes in facial expressions and may potentially lose local features of images.They proposed Squeeze-ViT, which can reduce feature dimensions to lower computational complexity and integrate global and local features to enhance network performance.Zhao et al. [11] proposed the Former-DFER network structure for natural environmental scenes, composed of a CS transformer (a convolutional spatial transformer) and a T transformer (a temporal transformer).This architecture guides the network to learn spatial and contextual facial features, thus improving the accuracy of facial expression classification.

Overview of the Swin Transformer
The Swin transformer, as described by Liu et al. [7], is a hierarchical vision transformer designed to handle various computer vision tasks efficiently.It uses a shifted-windows approach for self-attention, which allows for better scalability and performance compared to existing transformer-based models.It builds hierarchical feature maps, similar to convolutional neural networks (CNNs), enabling it to capture multi-scale representations effectively.
A key innovation of the Swin transformer is its use of shifted windows for selfattention computation.Self-attention is first computed within local, non-overlapping windows, and then these window positions are shifted between successive layers.This approach facilitates cross-window connections, enhancing the model's ability to capture long-range dependencies while maintaining computational efficiency.The combination of non-overlapping and shifted windows allows the Swin transformer to balance local context extraction within windows with global context integration across windows, leading to high performance in vision tasks.By focusing on local windows, the computational complexity of the self-attention mechanism is significantly reduced, making the Swin transformer more efficient and scalable for high-resolution images.Liang et al. [12] addresses the challenges of facial expression recognition (FER) posed by occlusions and head-pose variations using a convolution-transformer dual branch network (CT-DBN).The CT-DBN leverages the strengths of both convolutional neural networks (CNNs) and the Swin transformer to capture local and global facial information, respectively.Qin et al. [13] integrated a Multi-Level Channel Attention (MLCA) module into each task-specific subnet, enabling adaptive feature selection from optimal levels and channels.This design allows the Swin transformer to efficiently and accurately perform facial expression recognition, achieving strong experimental results and demonstrating superior understanding of facial features.

Proposed Method
This paper designed a facial expression recognition method based on the Swin transformer, named Swin-Fer, whose network structure is depicted in Figure 1.Considering the characteristics of facial expression recognition tasks, deeper-level features contain richer semantic information.Swin-Fer employs a fusion strategy from middle to deep layers [14].After the image is input into the network, the image window segmentation and patch embedding operations are carried out.The input image is divided into blocks of fixed sizes, and each block is embedded to obtain a fixed-size vector representation.The channel number is converted to 96, equivalent to performing a 4 × 4 convolution (kernel size = 4) without overlapping regions, with each movement having a stride of 4 (stride = 4).Hence, an input image with a size of B × 3 × 224 × 224 is 56 × 56 after flattening.The data are then flattened from B × C × H × W (batch size × channel number × height × width) along the height direction, which effectively converts the spatial dimensions into one.Subsequently, a transpose operation is performed to exchange the spatial and channel dimensions.

Patch Merging
Patch merging is a technique employed in transformer models to enhance the efficiency of image processing.It involves dividing the input image into several sub-images which are then concatenated and fed into the transformer model for processing, as illustrated in Figure 2.This approach reduces the computational load of the model by avoiding processing the entire large-scale image directly, thus improving efficiency when dealing with large images.In Figure 1, STB represents the Swin transformer block, where STB1, STB2, STB3, and STB4 include one pair, one pair, three pairs, and one pair of W-MSA and SW-MSA combinations, respectively, and iterates on this basis.The extraction and fusion of the information occurs after the patch merging layer between STB2, STB3, and STB4, resulting in STB2P and STB3P.The letter 'P' in STB2P and STB3P denotes the output after the patch merging operation.Specifically, STB2P refers to the output after the patch merging that occurs between STB2 and STB3, while STB3P represents the output after the patch merging that occurs between STB3 and STB4.Layer normalization is a regularization technique in deep learning, which is similar to batch normalization (BN), but it calculates the mean and variance of each layer rather than each batch of each neuron, making it more suitable for sequence models.After layer normalization of STB4, STB4L is obtained.Along with STB4's output (STB4O) and the earlier STB2P and STB3P, the information output results from these four levels undergo data dimension transformation, adaptive average pooling, and mean operations before being merged.After passing through the split module, a fusion output is obtained.This result is added to STB4L, and adaptive average pooling is performed again to further reduce the spatial features to obtain the final output.

Patch Merging
Patch merging is a technique employed in transformer models to enhance the efficiency of image processing.It involves dividing the input image into several sub-images which are then concatenated and fed into the transformer model for processing, as illustrated in Figure 2.This approach reduces the computational load of the model by avoiding processing the entire large-scale image directly, thus improving efficiency when dealing with large images.

Patch Merging
Patch merging is a technique employed in transformer models to enhance the efficiency of image processing.It involves dividing the input image into several sub-images which are then concatenated and fed into the transformer model for processing, as illustrated in Figure 2.This approach reduces the computational load of the model by avoiding processing the entire large-scale image directly, thus improving efficiency when dealing with large images.The concatenated image is fed into the transformer model for processing.During this process, the patch merging compresses the detailed information of the input high-resolution feature map into a low-resolution feature map.While retaining the main information from the original feature map, this operation reduces computational complexity, thereby enhancing the calculation efficiency and generalization capability of the model [7,15].
The patch merging process between two STBs is equivalent to a downsampling operation without convolution.Previous experiments tried to extract features before patch merging, resulting in two issues: excessive parameters and irregular data, leading to The concatenated image is fed into the transformer model for processing.During this process, the patch merging compresses the detailed information of the input high-resolution feature map into a low-resolution feature map.While retaining the main information from the original feature map, this operation reduces computational complexity, thereby enhancing the calculation efficiency and generalization capability of the model [7,15].
The patch merging process between two STBs is equivalent to a downsampling operation without convolution.Previous experiments tried to extract features before patch merging, resulting in two issues: excessive parameters and irregular data, leading to suboptimal feature extraction.Therefore, all experiments in this study executed feature data extraction after patch merging.

Dimensional Transformation of Data
Within each STB, all odd blocks have a shift size of 0, while all even blocks have a shift size of 3. A shift size of 0 implies no operation, and the spatial features are directly normalized before being unfolded.When the shift size is 3, the dimensions (1 and 2) are shifted, and the shifted content is filled circularly, thus realizing the torch rolling operation.The purpose of torch rolling is to disrupt the internal structure of the data, maintaining the same data in the feature map but altering the position, thus disrupting the dependencies and linear relationships between pixels, facilitating more complex interaction between grids.Regardless of the subsequent operation on odd or even blocks, the output feature dimensions remain the same.The output window features (X_window) need to be merged again, equivalent to recombining spatial features, and with the application of windowbased multi-head attention mechanism and shifted-windows multi-head self-attention (SW-MSA), interactions within and between windows are achieved, further improving model performance.
In addition, the multi-head self-attention mechanism divides the input data into several heads, and each head generates different query, key, and value vectors and computes the corresponding attention-weighted results.These results are then concatenated to form the final output of the multi-head self-attention mechanism.This output is a three-dimensional tensor that needs to be reshaped to a four-dimensional tensor for better matching with subsequent feature fusion calculations.For example, position encoding needs to add position information to input features, and decoder calculation needs to concatenate the encoder results with target sequences.Therefore, converting the three-dimensional tensor to a four-dimensional tensor facilitates these calculations, improving model efficiency and accuracy.
Assume that the input datum is X ∈ R B×L×D , where B represents batch size, L denotes sequence length, and D represents input dimensionality.In the multi-head self-attention mechanism, the input datum is divided into H parts through linear mapping, and after multi-head attention calculation, the output with shape Y ∈ R B×L×H is obtained.
Initially, H is divided into num_heads parts, and the input dimension, D, is also divided into the same number of parts, that is, D = num_heads × head_dim, where head_dim represents the dimension size of each head.The output tensor, Y, of the multi-head self-attention mechanism is transformed into a four-dimensional tensor, Z ∈ R B×L×num_heads×head_dim , as follows: This operation segments the last dimension of the output tensor, creating a new dimension for each head in the new shape.The transformation from three-dimensional to four-dimensional data effectively opens up spatial features before feature fusion, so that the whole network model can perceive more spatial dimension information, thus enhancing classification performance.

Mean Module
In transformer networks, the mean operation is usually used to reduce the dimensionality of feature tensors extracted through self-attention mechanisms, enabling them to be input into fully connected layers for classification.For instance, in the vision transformer proposed by Dosovitskiy et al. [16], a feature tensor is divided into multiple blocks, followed by self-attention mechanisms, resulting in feature vectors for each block.Subsequently, each feature vector undergoes a mean operation to obtain a fixed-length feature vector for final classifier input.
In the Swin-Fer network, images are segmented into blocks, and features are extracted by self-attention mechanisms; thus, a sequence of features is generated.The mean module allows the obtainment of the average information of the data, achieving dimensionality reduction and reducing computational complexity while retaining the main information.
Assume that the size of the obtained feature tensor is H × W × n, where H represents height, W represents width, and n represents the number of channels [17].The calculation formula for the mean module is as follows: where mean(X) i,j represents the pixel value of the pooled output feature map at position (i, j), while X i,j,k represents the pixel value of the input feature map X at position (i,j,k).The above formula indicates averaging pooling along the third dimension (channel dimension) of the feature tensor to obtain a new feature tensor.Since this operation aggregates the original n channels into one channel, the number of channels changes from n to 1.This averaging pooling method concatenates vectors without increasing network parameters.

Split Module
Swin-Fer focuses on controlling the number of parameters while performing facial expression recognition tasks.With data dimensions already high, it is essential to minimize weight or bias operations as much as possible.As shown in Figure 1, in Branch1 and Branch2, each branch contains a group convolution.Because the data processing at this stage is not very extensive, group convolution can be considered instead of regular convolution to reduce parameters, especially for small datasets.As shown in Figure 3, the group convolution strategy first divides the input feature map into several groups, with an equal number of channels in each group.Then, convolution operations are performed separately on each group, and, finally, the convolution results of each group are concatenated as the final output.Through this channel grouping strategy, different groups can learn different features, enhancing feature diversity.Additionally, group convolution enables better learning of inter-channel relationships, strengthening model representation capability.Specifically: where X (g) represents the g-th group of the input feature map, with a total of G groups.
After group convolution, the output feature map is X ′ , "||" denotes the tensor concatenation operation, K (g) represents the convolution kernel of the g-th group, and b is the bias term.The results of these two branches are concatenated.Subsequently, the tensor dimension transposition operation is performed, where the input is [B, 2, 2, 4, 4] and the output is [B, 2, 2, 4, 4].Although the second and third dimensions are both 2, they actually switch positions (as shown in Figure 4).The transposed tensor allows concatenation along the batch processing dimension and better utilization of hardware parallelism to speed up calculation.Generally, after tensor dimension transposition, position information is restored.However, since the data volume at this stage of the experiment is already small, the impact on experimental results is negligible, and the restoration operation is not executed.

Experimental Datasets
The Cohn-Kanade dataset (CK+) is a laboratory-environment dataset, as depicted in Figure 5, showcasing sample expressions for each category.For the task of facial expres- The results of these two branches are concatenated.Subsequently, the tensor dimension transposition operation is performed, where the input is [B, 2, 2, 4, 4] and the output is [B, 2, 2, 4, 4].Although the second and third dimensions are both 2, they actually switch positions (as shown in Figure 4).The transposed tensor allows concatenation along the batch processing dimension and better utilization of hardware parallelism to speed up calculation.Generally, after tensor dimension transposition, position information is restored.However, since the data volume at this stage of the experiment is already small, the impact on experimental results is negligible, and the restoration operation is not executed.The results of these two branches are concatenated.Subsequently, the tensor dimension transposition operation is performed, where the input is [B, 2, 2, 4, 4] and the output is [B, 2, 2, 4, 4].Although the second and third dimensions are both 2, they actually switch positions (as shown in Figure 4).The transposed tensor allows concatenation along the batch processing dimension and better utilization of hardware parallelism to speed up calculation.Generally, after tensor dimension transposition, position information is restored.However, since the data volume at this stage of the experiment is already small, the impact on experimental results is negligible, and the restoration operation is not executed.

Experimental Datasets
The Cohn-Kanade dataset (CK+) is a laboratory-environment dataset, as depicted in Figure 5, showcasing sample expressions for each category.For the task of facial expres-

Experimental Datasets
The Cohn-Kanade dataset (CK+) is a laboratory-environment dataset, as depicted in Figure 5, showcasing sample expressions for each category.For the task of facial expression recognition in static images, the last three frames of the expression sequence [18], known as the peak expression states, are usually selected for training and testing.In this paper, 981 images are selected as the experimental dataset, comprising 882 images for training and 99 images for testing.The sample distribution of each category is detailed in Table 1.Fer2013, a representative dataset for facial expressions in natural environments, included facial occlusions (e.g., hands, hats, and glasses), low pixel resolutions, and facial images captured with arbitrary poses and angles.After preprocessing, the samples in the FER-2013 dataset were scaled to 48 × 48 pixels.As a challenging dataset, Figure 6 illustrates sample images for each category, and Table 2 outlines the sample distribution for each classification.

Experimental Environment
The experimental environment for Swin-Fer is based on Win10, with a 12th Gen Intel(R) Core(TM) i7 3.6GHz CPU and an NVIDIA Geforce RTX 4090 GPU, within a python development environment.During model training, the adaptive learning rate optimization algorithm, the Adam optimizer, is adopted, dynamically adjusting the learning rate based on gradient information in each iteration, thus achieving faster and more stable convergence.All experiments in the manuscript were conducted using the same environmental settings and hyperparameter configurations.The specific experimental parameters are presented in Table 3.   Fer2013, a representative dataset for facial expressions in natural environments, included facial occlusions (e.g., hands, hats, and glasses), low pixel resolutions, and facial images captured with arbitrary poses and angles.After preprocessing, the samples in the FER-2013 dataset were scaled to 48 × 48 pixels.As a challenging dataset, Figure 6 illustrates sample images for each category, and Table 2 outlines the sample distribution for each classification.Fer2013, a representative dataset for facial expressions in natural environments, included facial occlusions (e.g., hands, hats, and glasses), low pixel resolutions, and facial images captured with arbitrary poses and angles.After preprocessing, the samples in the FER-2013 dataset were scaled to 48 × 48 pixels.As a challenging dataset, Figure 6 illustrates sample images for each category, and Table 2 outlines the sample distribution for each classification.

Experimental Environment
The experimental environment for Swin-Fer is based on Win10, with a 12th Gen Intel(R) Core(TM) i7 3.6GHz CPU and an NVIDIA Geforce RTX 4090 GPU, within a python development environment.During model training, the adaptive learning rate optimization algorithm, the Adam optimizer, is adopted, dynamically adjusting the learning rate based on gradient information in each iteration, thus achieving faster and more stable convergence.All experiments in the manuscript were conducted using the same environmental settings and hyperparameter configurations.The specific experimental parameters are presented in Table 3.

Experimental Environment
The experimental environment for Swin-Fer is based on Win10, with a 12th Gen Intel(R) Core(TM) i7 3.6GHz CPU and an NVIDIA Geforce RTX 4090 GPU, within a python development environment.During model training, the adaptive learning rate optimization algorithm, the Adam optimizer, is adopted, dynamically adjusting the learning rate based on gradient information in each iteration, thus achieving faster and more stable convergence.All experiments in the manuscript were conducted using the same environmental settings and hyperparameter configurations.The specific experimental parameters are presented in Table 3.

Experimental Results
Applying a transformer to the FER2013 dataset for facial expression recognition is a challenging task due to the small original image size of 48 × 48, while most transformer models are suitable for sizes of 224 and 356.The Swin transformer requires a large image size for operations such as patch embedding, with a step size of 4. Therefore, to obtain the transformer's benchmark, the image size of the facial expression pictures had to be enlarged to 224 × 224.We tested four lightweight transformer structures, with specific experimental results presented in Table 4.With almost no change in the parameter count, the recognition accuracy of our method reached 71.11% on FER2013-an increase of 0.41% compared to the original Swin transformer-and 100% on CK+-a 3.75% improvement with respect to the original network.To expand the practicality of transformers, there are usually different types of the same transformer architecture, such as tiny, small, base, and large.Increasing the model in transformer methods typically improves the accuracy, but there is also a corresponding exponential increase in parameters between different models.The transformer models selected in Table 4 are all lightweight structures, such as small and tiny-level structures, emphasizing the use of methods like mean, adaptive pooling, and group convolution to control the number of parameters [19].As shown in Table 5, while applying a fusion strategy to better extract features, the total parameters of the proposed Swin-Fer network increased by only 70 B compared to the original Swin transformer.The experimental results for Swin-Fer and the original Swin transformer on the FER2013 dataset are shown in Figure 7. From Figure 7a, it is evident that due to the use of pre-training methods, both models quickly showed a trend of convergence, stabilizing around 40 iterations.For the facial expression recognition problem, which is indeed a multi-class classification problem, we adopted the One-vs.-Reststrategy to generate the ROC curves.As shown in Figure 7b, the area under the ROC curve for both models is 0.84, with the curves deviating obviously from the 45-degree diagonal line, indicating good sensitivity and specificity.However, as can be seen in Figure 7c, it is apparent that the Swin-Fer method has relatively smaller fluctuations in accuracy, exhibiting more stability.Finally, from the confusion matrices presented in Figure 7d,e, the accuracy changes in various classifications between Swin-Fer and the original Swin transformer are as follows: anger (+0.2%), disgust (+3.6%), fear (+0.7%), happiness (−1.7%), neutral (+1.4%), sadness (+1%), and surprise (+1.9%), with an increase in accuracy for six out of seven classifications, demonstrating the effectiveness of the proposed improvement algorithm.

Comparison of Experimental Accuracy
The accuracy of the Swin-Fer network structure proposed in this paper along with the accuracies of other methods introduced in the past five years on the FER2013 and CK+ datasets are compared in Table 6.Some representative research on FER2013 achieved more than 70% accuracy, which shows that facial expression recognition tasks in natural environments are challenging.Adjusting network structures and optimizing hyperparameters can help capture complex features in images, thus further improving model performance.

Comparison of Experimental Accuracy
The accuracy of the Swin-Fer network structure proposed in this paper along with the accuracies of other methods introduced in the past five years on the FER2013 and CK+ datasets are compared in Table 6.Some representative research on FER2013 achieved more than 70% accuracy, which shows that facial expression recognition tasks in natural environments are challenging.Adjusting network structures and optimizing hyperparameters can help capture complex features in images, thus further improving model performance.This paper has explored using transformer methods to handle facial expression recognition tasks, focusing on designing the Swin-Fer network model.With the network parameters mostly unchanged, competitive experimental results were achieved, with an accuracy of 71.11%.In addition, the third column of Table 6 compares the accuracies of different network structures on the CK+ dataset, and these accuracies are for the most part very high, with the lowest accuracy reaching 96.25%.The method proposed in this paper, Swin-Fer, achieved 100%.These experimental results enhance confidence in the generalization ability of the model.

Network
Fer2013 CK+ CNN using the Adamax optimizer [20] 66% -VGG16+SE Block [21] 66.8% 99.18% DeepEmotion [22] 70.02% 98% Swin transformer 70.70% 96.25% Improved CNN based on center loss [23] 71.39% 96.64% SSER [24] 71.62% 97.59% Multi-Channel Attention Residual Network [25] 72.7% 98.8% ResNet-50 + pyramid + cascaded attention block + GRU [26] -99.23%ViT+SE [27] -99.8% Swin-Fer 71.11% 100% To further compare the effectiveness of the proposed model, we conducted training on a larger dataset, AffectNet.AffectNet is a large-scale facial expression dataset designed for training and evaluating facial expression recognition models.It contains over one million facial images collected from the Internet, with approximately 45,000 manually annotated images for eight emotion categories.The images in the AffectNet dataset typically have a resolution of 256 × 256 pixels.These images cover a wide range of facial expressions and poses, making the dataset suitable for research on emotion analysis and facial expression recognition.This resolution is appropriate for training and testing deep learning models, particularly those that require high-resolution inputs, such as transformer models.Table 7 shows the comparison of experimental accuracy across different methods on the AffectNet (eight emotions) dataset.

Network AffectNet (Eight Emotions)
EfficientFace [28] 59.89% SL + SSL puzzling (B2) [29] 61.32% Multi-Task EfficientNet-B0 [30] 61.32% DAN [31] 62.09% CAGE [32] 62.30% Vit-base + MAE [33] 62.42% Swin-Fer 63.29% As indicated in Tables 6 and 7, although the proposed method did not achieve the highest accuracy on the FER2013 dataset, it demonstrated superior performance on the Af-fectNet (eight emotions) dataset compared to other advanced models.The larger resolution and color images of AffectNet likely better suit the capabilities of Swin-Fer, highlighting its strength in extracting feature information from high-resolution color images.This suggests that Swin-Fer, based on the Swin transformer for feature extraction, performs well in settings where input images are of larger spatial dimensions, allowing the model to extract more effective feature information.

Conclusions
Facial expression recognition is a challenging computer vision task.The key regions of different expressions exhibit diverse distributions, especially in natural conditions, such as head posture, lighting changes, and occlusion, which make feature extraction particularly difficult [34,35].In order to enhance the model's generalization, Swin-Fer contains a split module, replaces ordinary convolutions with group convolutions in two branches, and effectively controls the number of parameters through strategies such as averaging and adaptive pooling.
According to the experimental results, the proposed method effectively focuses on the most critical regions of facial expressions and achieves competitive results on datasets from both natural conditions (FER2013) and laboratory environments (CK+).Due to the low sample pixel quality of the Fer2013 dataset, the performance of transformer models is limited, and accuracy needs to be further improved.However, on the AffectNet dataset, which consists of higher pixel quality and colored images, performance was notably better, demonstrating the advantages of Swin-Fer.Enhancing the performance of transformer models on low-resolution facial expression images remains a key focus for future research.Additionally, current research is limited to static-image single-person facial expression recognition.Future research should include a broader range of practical scenarios, such as real-time, multi-person, and dynamic video facial expression recognition, and provide powerful support tools for practical applications.

Figure 8
Figure 8 presents the experimental results for Swin-Fer and the original Swin transformer network on the CK+ dataset.Due to the limited sample size of the CK+ dataset, the model may be constrained by insufficient data during training and validation, resulting in fluctuations in accuracy.The proposed Swin-Fer method, based on fusion methods and

Figure 8 Figure 8 .
Figure 8 presents the experimental results for Swin-Fer and the original Swin transformer network on the CK+ dataset.Due to the limited sample size of the CK+ dataset, the model may be constrained by insufficient data during training and validation, resulting in fluctuations in accuracy.The proposed Swin-Fer method, based on fusion methods and introducing split modules and data dimension transformation strategies, enhances the model's ability to capture facial expression details.The experimental results show that compared to the original Swin transformer model, the Swin-Fer method has a relatively smoother loss curve on the CK+ dataset, indicating more stable performance during training.Additionally, the accuracy increased by 3.75%, which shows that the Swin-Fer method also exhibits certain potential and advantages in facial expression recognition tasks in laboratory environments.Appl.Sci.2024, 14, 6125 11 of 14

Table 2 .
Sample distribution of the FER2013 dataset.

Table 1 .
Sample distribution of the CK+ dataset.

Table 2 .
Sample distribution of the FER2013 dataset.

Table 2 .
Sample distribution of the FER2013 dataset.

Table 4 .
Accuracy of transformer methods on the FER2013 and CK+ datasets.

Table 5 .
Comparison of network parameter counts for transformer methods.

Table 6 .
Comparison of experimental accuracy of different methods on FER2013 and CK+ datasets.

Table 7 .
Comparison of experimental accuracy of different methods on AffectNet datasets.