Fine-Tuning Swin Transformer and Multiple Weights Optimality-Seeking for Facial Expression Recognition

Facial expression recognition plays a key role in human-computer emotional interaction. However, human faces in real environments are affected by various unfavorable factors, which will result in the reduction of expression recognition accuracy. In this paper, we proposed a novel method which combines Fine-tuning Swin Transformer and Multiple Weights Optimality-seeking (FST-MWOS) to enhanced expression recognition performance. FST-MWOS mainly consists of two crucial components: Fine-tuning Swin Transformer (FST) and Multiple Weights Optimality-seeking (MWOS). FST takes Swin Transformer Large as the backbone network to obtain multiple groups of fine-tuned model weights for the homologous data domains by hyperparameters configurations, data augmentation methods, etc. In MWOS a greedy strategy was used to mine locally optimal generalizations in the optimal epoch interval of each group of fine-tuned model weights. Then, the optimality-seeking for multiple groups of locally optimal weights was utilized to obtain the global optimal solution. Experiments results on RAF-DB, FERPlus and AffectNet datasets show that the proposed FST-MWOS method outperforms various state-of-the-art methods.


I. INTRODUCTION
Facial expression is one of the most natural, powerful and universal signals for human beings to express their emotional states and intentions [1]. Facial expression recognition technology has a wide range of applications in the field of human-computer interaction such as social robots, medical diagnosis and fatigue monitoring [2]. With the increasing number of people living alone, how to provide them with emotional comfort has become a key concern for the society [3]. Many researchers have focused on emotionally interactive robots [4], [5], [6]. However, in the process of real human-computer interaction, the human face is usually affected by various interference factors, which undoubtedly The associate editor coordinating the review of this manuscript and approving it for publication was Syed Islam .
increases the difficulty of expression recognition. As facial expression recognition methods have been intensively studied, many researchers have introduced attention mechanisms to perceive occlusion and posture changes [7], [8], designed methods to suppress label annotation ambiguity [9], [10]. In addition, Visual Transformer (ViT) is also applied to facial expression recognition [11], [12] to enhance the correlation between detailed features and achieve the most advanced facial expression recognition performance. However, the recognition performance using only the best-performing individual model is approaching a bottleneck. Meanwhile the remaining sub-optimal models which took a lot of time and resources to obtain are unable to achieve their value.
In order to tap the value in the multi-group model to effectively improve the accuracy of facial expression recognition, we proposed a method that combines Fine-tuning Swin Transformer and Multiple Weights Optimality-seeking (FST-MWOS). FST-MWOS mainly consists of two crucial components: Fine-tuning Swin Transformer (FST) and Multiple Weights Optimality-seeking (MWOS). These two key components complement each other to maximize facial expression recognition performance. First, Swin Transformer Large (Swin-L) is used as the backbone network to extract the basic facial features while enhancing the correlation between the feature sequences. Then, the classification head is adjusted to apply to the expression recognition task. Next, multiple groups of model weights for the homologous data domains are obtained by fine-tuning hyperparameters and data augmentation. Finally, MWOS employs local greedy strategy and global greedy strategy to filter invalid weight groups and to mine the local-global optimal recognition performance.
Overall, our main contributions are summarized as follows: • A novel FST-MWOS method was proposed to perform FER. FST-MWOS can improve the accuracy of expression recognition by finding the optimal extreme points from multiple fine-tuned model weights of homologous data domains.
• We designed a simple but effective multiple weights optimality-seeking component named MWOS. This component combines local greedy strategy and global greedy strategy to select information about model weights that play a positive role in recognition performance.
• Our FST-MWOS method is extensively evaluated on the three public FER datasets. FST-MWOS achieved 90.38%, 90.41% and 63.33% recognition accuracy on the RAF-DB, FERPlus and AffectNet, outperforms a variety of state-of-the-art methods.

II. RELATED WORK A. FACIAL EXPRESSION RECOGNITION
With the development of deep learning, learning-based methods [13], [14], [15] have replaced traditional manual methods [16], [17], [18] with their powerful feature extraction capabilities. However, facial images are subject to multiple uncertainties in the wild, and it's difficult to extract discriminative features used CNN alone. For this reason, some researchers have applied attention mechanisms to the network. Li et al. [7] proposed a CNN with attention mechanism (ACNN) that can perceive the occlusion regions of the face and focus on the most discriminative un-occluded regions. Wang et al. [8] proposed a Region Attention Networks (RAN), to adaptively capture the importance of facial regions for occlusion and pose variant. In addition to addressing facial occlusion and posture variation, several methods work to suppress label annotation ambiguity due to subjective annotations and inter-class similarity of facial expressions. Wang et al. [9] proposed a Self-Cure Network (SCN) with rank regularization and lowest-ranked group relabeling, which suppresses the uncertainties efficiently. Ruan et al. [10] proposed a Feature Decomposition and Reconstruction Learning method (FDRL), which reconstructs the expression features with similarity by decomposes the basic features to perceive latent features and captures intra-feature and interfeature relationships for latent features. With the rise of Vision Transformer in facial expression recognition tasks, Huang et al. [11] combined a grid-wise attention mechanism and Visual Transformer to learn feature dependencies and global representations. Ma et al. [12] proposed the Visual Transformers with Feature Fusion (VTFF), which used a multi-layer Transformer encoder to strengthen the correlation between multi-branch fusion features, effectively improves the accuracy of expression recognition. Li et al. [19] designed a pure transformer-based mask vision transformer (MVT) to filter out complex backgrounds and occlusion of face images, and correct incorrect expression labels. However, the current mainstream learning-based FER methods focuses only on pick the individual model which performs best on a held-out validation set, ignore the correlation between multiple groups of fine-tuned weights.

B. TRANSFORMER IN COMPUTER VISION
Transformer [20] presents a dominant position in the NLP field with its superior sequence modeling capability and global information perception. Some researchers have tried to apply Transformer to computer vision tasks. Dosovitskiy et al. [21] segmented the input images into multiple patches and flattened them into 1D patch embeddings, which were fed to the Transformer encoder. This is the first time Transformer has been applied to an computer vision task. He et al. [22] demonstrates the great promise of Vision Transformer by using ViT as an encoder to improve training speed and recognition performance. Researchers continued to explore deeper and obtained many variants of ViT [23], [24], [25]. Liu et al. [26] proposed Swin Transformer, which designed a shifted window scheme allows flexibility in modeling image features at various scales and enhances long-term dependencies between feature sequences. Our FST component is fine-tunes the Swin Transformer to make it suitable for facial expression recognition in human-computer dialogue scenarios.

C. WEIGHTS OPTIMALITY
Several researchers have worked on designing weight optimization methods to maximize the accuracy of the models, which are widely used in various tasks because they brings no extra burden for inference and memory. Izmailov et al. [27] proposed a Stochastic Weight Averaging (SWA) method, which simple averaging of multiple points along the trajectory of gradient. Cha et al. [28] by a more dense random weight sampling strategy to suppress the overfitting phenomenon present in SWA. But, all the above methods only studied the weights of individual models. Wortsman et al. [29] further improves the performance of the model by obtaining multiple groups of fine-tuned model weights with different hyper-parameter configurations and mining the optimal extremes uses a greedy averaging strategy. Inspired by this mind, we hypothesize that it is feasible and effective to improve expression recognition performance by greedy strategy from a local-global perspective.

III. OUR METHOD A. OVERVIEW
An overview of the proposed method is shown in Fig 1. Given a training samples of facial expression recognition database D = (X i , y i ), where X i is the input image, its size is 3 × H × W , and y i ∈ {1, . . . , N } is its corresponding label for a N -class expression recognition problem. First, let the deep dependency feature map X * i ∈ R C out ×(H * W ) be the output of a Swin-L network pre-trained on ImageNet22K database, where C out = 1536 is the number of output channels. Then, a classification head (contains average pooling layer and fully connected layer) take X * i as input and mapped it to a raw score vector x i ∈ R N ×1 . The probability distribution of an N -class expression recognition P (y = p|x i ) is then calculated using the softmax function. At the same time, the model weights W i = W i1 , . . . , W iep for each epoch of the training process are saved, where ep is total number of training epochs. Fine-tune the hyper-parameter configurations S = {S 1 , . . . , S n } and repeat the above process to obtain n groups of model weights W = {W 1 , . . . , W n } for the homologous data domains but with different recognition performance. Finally, a local-global greedy strategy is used to mine the most superior performing W best among multiple groups of fine-tuned model weights, which will be introduced in Section III-B in detail.

B. MULTIPLE WEIGHTS OPTIMALITY-SEEKING
The current mainstream FER methods pick only the individual model with the most superior performance and ignore the value of the remaining model weights. Thus, we designed the MWOS component, which consists of local greedy strategy and global greedy strategy. It is to mine the best performance of the multi-group fine-tuned model weights with no extra burden for inference and memory.

1) LOCAL GREEDY STRATEGY (LGS)
LGS seeks weight-optimal generalization from the best epoch interval of a individual model, as can be seen in Fig 2. Given a groups of model weights W mBEI = W m(best−k) , . . . , W mbest , . . . , W m(best+k) , where W mbest indicates the model weights of the best performing epochs of the training process and [best − k, best + k] notes upper and lower limits of the interval, the k is set to 4 in this paper. With equation (1), we compute the W mg as the groups of local greed weights, that is: where the implementation steps of Greedy Strategy are represented in Algorithm 1: 1) Sort the groups of model weights in decreasing order of performance on the validation set; 2) Initialize the greedy model weights W g = {} and best performance on the validation set Best val = 0; 3) Add W g in order and compare to Best val ; 4) Only model weights that have a positive improvement on recognition performance are retained. Based on equation (2), we use an average of W mg to determine the local optimal generalization W m :  where g is the number of model weights being retained in W mg .

2) GLOBAL GREEDY STRATEGY (GGS)
GGS aims to improve recognition performance by screening out valuable information from multiple groups of local optimal generalizations. Given a sets of hyperparameter and data augmentation configurations S = {S 1 , . . . , S n }, finetune the Swin-L network to obtain a groups of local greedy weights W = {W 1 , . . . , W n }, that is: where W 0 denote pre-trained initialization (pre-trained on ImageNet22K database), the n is set to 15 in this paper. Next, the Greedy Strategy is used to find the groups of global greedy weights W g that has a positive effect on the recognition performance, that is: Finally, averaging over the W g to obtain the local-global optimal extremum W best , that is: where h is the number of model weights being retained in W g .

IV. EXPERIMENTS
In this section, we first describe three publicly available wild FER datasets (i.e., RAF-DB, FERPlus and AffectNet

2) FERPlus
Reference [31] is extended from the original FER2013 dataset, which contains 28,709 training images and 3,589 test images. Each facial image is labelled by 10 crowdsourced annotators voting. For a fair comparison, the majority voting mode is picked to obtain the annotation for each image (removal of unknown images and non-face images). Example of the samples is shown in Fig 3.

3) AffectNet
Reference [32] is currently the largest FER dataset with more than 450,000 images facial images collected from the Internet, which have been annotated manually. In our experiment, we utilize images are annotated with eight basic expressions, providing 286621 images for training and 4,000 images for testing. Example of the samples is shown in

2) RANDOM ERASING
Reference [34], a data augmentation method which effective reduces the risk of over-fitting. It randomly selects a rectangle region in an image and erases its pixels with random values. In this process, the training samples with different occlusion levels can be generated by adjusting the erasure values.

3) MIX UP
Reference [35] is a simple and data-agnostic data augmentation method. It generates new sample-label data X , y by summing two sample-label data pairs (X i , y i ) and X j , y j proportionally to increase the generalization ability of the model.

C. IMPLEMENTATION DETAILS
For each datasets, all the facial images are detected and cropped by MTCNN [33], and the cropped images are further resized to the size of 256 × 256. At training time, the input facial images are randomly cropped to the size of 224 × 224 and we augment these by horizontal flips, random erasing and mix up. During the test precess, we obtain the input images of size 224 × 224 by center crop.
In our experiments, the FST-MWOS method is implemented with the Pytorch toolbox and conduct all the experiments on a single NVIDIA GTX 3090Ti GPU card. We use the Swin-L network as the backbone, which pre-trained on the ImageNet22K dataset. The AdamW optimizer [36] is used to optimize the whole networks with a batch size of 32 and train the model for 50 epochs on three datasets. We use a linear learning rate warmup of 5 epochs and cosine learning rate decay. The label smoothing cross-entropy loss [37] is utilized to supervise the model to generalize well for expression recognition. Multiple groups of model weights are obtained by fine-tuning the initial learning rate, weight decay, and data augmentation, the fine-tuning parameter configurations is shown in Table 1.

D. VISUALIZATION ANALYSIS
To further our method, we conduct visualizations of the performance distribution and the confusion matrices.

1) DISTRIBUTION OF MULTIPLE WEIGHTS OPTIMALITY-SEEKING PERFORMANCE
As shown in Fig 4, we visualize the distribution of expression recognition performance on the RAF-DB, FERPlus and AffectNet datasets. Specifically, the figure depicts the performance of the best epoch interval (green diamonds) with 15 groups of fine-tuned weights. The local greedy performance (blue circle) and the global greedy performance (red pentagram) are marked on the horizontal coordinate point 'best'. We can observe that the global greedy recognition accuracies of 90.38%, 90.41% and 63.33% on the RAF-DB, FERPlus and AffectNet datasets are much higher than the performance of individual model, which is most evident on the RAF-DB. The results illustrate that our proposed method can effectively tap the correlation between the weights of the multi-group fine-tuning model and further enhance the expression recognition performance.

E. COMPARISON WITH STATE-OF-THE-ART METHODS
We compare FST-MWOS method with several state-of-theart methods on 3 public datasets in Table 2. Since gaCNN [7], DACL [38] and FDRL [10] did not report the expression recognition accuracy for the FERPlus dataset, FER-VT [11] and FDRL did not report the specific expression recognition accuracy for the AffectNet dataset, the corresponding places are marked with ′ − ′ . Among all the competing methods, gaCNN and RAN [8] aim to disentangle the disturbing factors in facial expression images, such as occlusion and posture changes. SCN [9], DACL, DMUE [39] and FDRL are proposed to solve the noise label problem. VTFF [12], FER-VT and MVT [19] combined with Vision Transformer to improve model recognition performance. The above advanced methods improve the FER performance by suppressing the influence of different disturbing factors or combined ViT, but they focus only on the best performance individual model.

1) COMPARISON RESULTS ON RAF-DB DATASET
Overall, our proposed method achieves 90.38% recognition performance on RAF-DB. As shown in Table 2, our method achieves the best results among all methods. In detail, our FST-MWOS has obtained gains of 0.91% over FDRL, which is the previous state-of-the-art method. This result demonstrates the effectiveness and the superiority of our method in performance enhancement.

2) COMPARISON RESULTS ON FERPlus DATASET
As we can see in Table 2, our proposed FST-MWOS achieves 90.41% recognition accuracy on FERPlus. Under the same experiment settings, the total improvements of our FST-MWOS on FERPlus are 0.37% and 0.90% when compared with FER-VT and DMUE. From the above results on the RAF-DB and FERPlus datasets, our method has superior performance on small datasets.

3) COMPARISON RESULTS ON AffectNet DATASET
It should be noted that AffectNet has imbalanced training and balanced validation sets, where the distribution of training samples as shown in Fig 6.(a). To deal with the imbalance issue, we adopt the oversampling strategy (WeightedRan-domSampler) provided by Pytorch toolbox. The treated results are shown in Figure 6.(b), where the distribution of the training samples is greatly improved compared to that before the treatment. As shown in Table 2, we achieved 63.33% recognition accuracy on the oversampled AffectNet, which outperforms VTFF and DMUE by 1.48% and 0.22%. The improvements of our method over previous methods suggest that the FST-MWOS indeed has better generalization ability even on large-scale expression recognition datasets like AffectNet.
Overall, the above comparison results prove the effectiveness and superiority of our proposed method. The FST-MWOS method further improves the expression recognition performance by focusing on mining the correlation between multiple groups of fine-tuned model weights.

F. ABLATION STUDY
To show the effectiveness of our method, we perform ablation studies to evaluate the influence of key parameters and components on the final performance. For all the experiments, we use RAF-DB, FERPlus and AffectNet datasets to evaluate the performance.

1) INFLUENCE OF THE KEY COMPONENTS
Our proposed method FST-MWOS consists of fine-tuning Swin Transformer (FST) and multiple weights optimalityseeking (MWOS). To validate the effectiveness of these components, we perform ablation studies for FST and MWOS on RAF-DB, FERPlus and AffectNet databases, respectively. The experimental setup and results are reported in Table 3, where the MWOS contains two parts: local greedy strategy (LGS) and global greedy strategy (GGS). Since GGS is optimality-seeking for n groups of fine-tuning model weights, it cannot be effective without the FST component.
Settings (I, III) demonstrate that integrating the FST component improves the baselines (the best individual model) on RAF-DB, FERPlus and AffectNet by 1.01%, 0.09% and 0.85%, which suggests the FST component are beneficial in improving expression recognition performance. This can be explained by that the FST component provides multiple groups of model weights to support the subsuent optimality-seeking by fine-tuning the parameters configurations of the Swin Transformer network. According to settings (II, III, IV), the MWOS leads to an increase in recognition accuracy when fusing the LGS and the GGS, showing the effectiveness of the proposed MWOS. Specifically, we can see from settings (III,IV) that the designed LGS further improves the performance by 0.35%, 0.25% and 0.23%. MWOS aggregates local and global greedy model weights to mine optimal generalization, which further improves the recognition performance.

2) INFLUENCE OF THE k
The LGS is to find the optimal solution in the best epoch interval of an individual model. Therefore, we setup different epoch intervals to study their effect on recognition accuracy, VOLUME 11, 2023 Table 4. From settings (I, II, III), we can clearly see that as the range of epoch intervals is expanded, we are able to find the extremes of better performance. However, settings (III, IV) reflects that the recognition performance has reached saturation values at k = 4. Table 5 reflects the performance of model with different n, where setting I indicates the individual model which performs best on validation set, which is used as the reference group. From the table, it can be seen that our method achieves the best recognition accuracy when the number of groups n is set to 15. At the same time, we observe that the optimal extremum cannot be found effectively when n is small, while increasing n excessively does not have a positive effect and increases the cost of training.

4) INFLUENCE OF TYPES OF SWIN TRANSFORMER
To evaluate the impact of different types of Swin Transformer on recognition performance. We selected four types (Tiny, Small, Base and Large) for our experiments and the results are shown in Table 6. We found that the recognition performance improves as the size of model gradually increases.

V. CONCLUSION
In this paper, we proposed a novel method to further enhance the facial expression recognition performance. FST-MWOS consists of two main components: FST and MWOS. FST makes the Swin-L model suitable for the FER task by fine-tuning it to extract facial features more accurately and to strengthen the correlation between features. MWOS mines the optimal generalization with positive performance improvement by local-global greedy strategy to effectively improve expression recognition accuracy. Experimental results on three publicly available facial expression datasets have shown the superiority of our method to perform FER.
Our future work includes the continuation of research related to methods for improving the performance of expression recognition and its application to emotion chatbots, which will ultimately be used in accompany people living alone.
HONGQI FENG received the B.S. degree in process equipment and control engineering from the Jiangsu University of Chemical Technology. He is currently a Professor and the master's degree Supervisor with Changzhou University, Jiangsu, China. He has presided over and participated in the completion of a number of scientific research projects, including the National Natural Science Foundation of China, and provincial and ministerial fund projects. His research interests include deep learning and natural language processing and data mining and big data analytics.
WEIKAI HUANG was born in Wenzhou, Zhejiang, China. He is currently pursuing the M.E. degree with Changzhou University. During his M.E. study, he was mainly responsible for the completion of the Provincial Research Fund Project related to healthy aging for the elderly. His research interests include facial expression recognition and deep learning.
DENGHUI ZHANG is currently a Professor and the master's degree Supervisor with Zhejiang Shuren University, Zhejiang, China. Up to now, he published more than 30 academic articles and presided over and participated in the completion of a number of scientific research projects, including the National Natural Science Foundation of China, and provincial and ministerial fund projects. His research interest includes intelligent human-computer interaction.
BANGZE ZHANG received the B.S. degree in information and computing sciences from Wenzhou University, Wenzhou, China, in 2020. He is currently pursuing the M.S.E. degree in computer technology with the School of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, China. His research interest includes medical image processing. VOLUME 11, 2023