We propose the fine-tuned channel–spatial attention transformer model (FT-CSAT), an efficient and effective transformer backbone consisting of two key modules: the channel–spatial attention module and the fine-tuning module. In this section, we first introduce the overall architecture of our model. Then, we describe these two modules in detail.
3.2. Channel–Spatial Attention Module
In recent years, the attention mechanism, as one of the important components of neural networks, has been widely used in the field of expression recognition. Hu et al. proposed the SE (Squeeze Extraction) module [
22], which learns the correlation relationships between various channels in the feature map, generates channel attention, and enables the network to focus more on informative channels. It brings significant performance improvements to CNNs. Convolutional Block Attention Module (CBAM), based on the attention mechanism of SE, was first proposed by Woo et al. [
23]. Compared with SENet, which only focuses on channel features, CBAM is an attention module that combines channel attention and spatial attention. It adaptively adjusts features along two independent dimensions.
Suppose the input feature image is
and the dimension is
. The module will obtain channel attention map
and spatial attention map
via channel and spatial attention modules successively. The specific process can be summarized as follows:
where ⊗ represents element by element multiplication,
is the corrected output of the channel attention module, and
is the final output corrected in the spatial attention module.
The channel attention module is used to focus on whether there are target features in the input image. First, average pooling and maximum pooling are performed, respectively, to obtain
and
. Then,
and
are obtained by weight sharing using the
MLP. Finally, the feature map
of channel attention is obtained using Sigmoid. The specific calculation formula is as follows:
where
represents the Sigmoid activation function, and
and
represent the weight of
MLP(·).
The spatial attention module mainly focuses on the feature information of the target’s location. Firstly, perform average pooling and maximum pooling operations on the feature
generated by the channel attention module. Then, concatenate the two generated two-dimensional vectors and perform convolution operations. Finally, spatial attention feature
is generated using Sigmoid. The specific calculation formula is as follows:
where
σ represents the Sigmoid activation function, and
denotes the convolution operation with a kernel size of 7 × 7.
This paper proposes to add the attention module CBAM to the CSWin Transformer network so that the network can extract effective features from channel and spatial dimensions and suppress invalid features. It will guide the model to identify the key areas related to expression, thereby improving the feature learning ability of the model.
Discussion. CSWin Transformer network, used as the baseline in our paper, consists of four stages. Each stage consists of sequential CSWin Transformer Blocks and maintains the number of tokens. A convolution layer (3 × 3, stride 2) is used between two adjacent stages to reduce the number of tokens and double the channel dimension. CBAM is an end-to-end universal module that can be seamlessly integrated into any position of a convolutional neural network for end-to-end training. Theoretically, CBAM can be integrated into any stage of CSWin Transformer. The main difference lies in the size and size of the feature graph of each stage of the network. However, in fact, the integration of CBAM into different stages of CSWin Transformer will have different impacts on the accuracy of expression recognition.
We conduct four integration approaches, as illustrated in
Figure 2. Method (a) integrates CBAM into the first stage of the CSWin Transformer, method (b) into the second stage, method (c) into the third stage, and method (d) into the fourth stage. The recognition accuracy of these different methods on the RAF-DB and FERPlus expression datasets is presented in
Table 1.
From
Table 1, it can be observed that, except for method (a), the expression recognition accuracy of all other methods has improved. On the RAF-DB dataset, method (d) achieved the highest recognition accuracy, reaching 87.58%, which is 0.29% higher than the baseline. On the FERPlus dataset, method (c) achieved the highest recognition accuracy, reaching 88.05%, which is 0.38% higher than the baseline. Therefore, we propose method (e), which integrates the CBAM module into the third and fourth stages of the CSWin Transformer simultaneously. The experimental results demonstrate that method (e) achieves higher expression recognition accuracy. The accuracy on the RAF-DB and FERPlus datasets increased from 87.29% and 87.67% of the baseline to 88.01% and 88.51%, respectively.
Experiments show that after integrating the CBAM module into the CSWin Transformer network, the maximum pooling and average pooling in the channel domain and spatial domain of the CBAM module can effectively learn discriminative global and local features from facial expression images and accurately calculate the weight of each spatial position in the feature map, thus strengthening the role of important spatial features in the feature map in FER tasks.
3.3. Fine Tuning Module
The existing fine-tuning methods are mainly divided into two types. One approach is full fine tuning. This method tunes all parameters of the pre-training model, which inevitably leads to the introduction of more parameters. The other is to tune the last linear layer, which solves the problem of introducing too many parameters, but the accuracy is significantly reduced compared with full fine-tuning. This paper uses the Scaling and Shifting Features (SSF) parameter fine-tuning method [
24]. Different from the above two methods, SSF can not only improve the performance of the model by fine tuning parameters but also controlling the number of parameters introduced.
The SSF parameter fine-tuning method can achieve parameter fine tuning only by scaling and shifting the deep features extracted using the pre-trained transformer model without introducing additional inference parameters. It draws on the concepts of variance and mean value. Pre-trained models trained on upstream datasets exhibit better feature extraction capabilities through scale and shift parameters. During the training of the downstream data set, the pre-training weight will be frozen, and the parameters will be updated until the feature input of SSF module. The feature output from the previous operation is performed dot product with a scale factor and then summed with a shift factor. The specific calculation formula is as follows:
where
represents the input.
is the output (is also the input of the next operation).
and
are the scale and shift factors, respectively.
is the dot product.
In this paper, SSF is inserted after some specified operations in the pre-training model to modulate features, as shown in
Figure 3. These specified operations include MLP, Cross-Shaped Window Self-Attention, and LN.
MLP with SSF. In the CSWin Transformer block, MLP consists of two fully connected layers, allowing the model to capture more complex relationships between image features and serve as the input for the next attention block. The output features of the fully connected layer are performed dots product with a scale factor and then summed with a shift factor. The MLP after inserting SSF is shown in
Figure 3a. The specific calculation formula is as follows:
where
is the input of the previous fully connected layer in the MLP,
is the weight, and
is the bias.
and
are the scale and shift factors, respectively.
Cross-Shaped Window Self-Attention with SSF. In CSWin Transformer, the Cross-Shaped Window Self-Attention mechanism based on the multi-head self-attention mechanism is proposed. The input is linearly transformed into a query
Q, a key
K, and a value
V. In this paper, the output of the linear conversion is fine tuned using the SSF method. Then, the input features are linearly projected to
K heads. The
K heads are equally split into two parallel groups (each has
K/2 heads). The 1, …,
k/2 heads perform horizontal stripes self-attention. The
k/2 + 1, …,
k heads perform vertical stripe self-attention. The output of these two parallel groups will be concatenated back together by a fully connected layer. In this paper, the SSF modules are inserted after the fully connected layers. The Cross-Shaped Window Self-Attention with SSF is shown in
Figure 3b.
LN with SSF. LN is used to normalize the output features of the CSWin-Attention, which helps to stabilize the training process and improve the performance of the model. Each CSWin Transformer block contains two LNs. As shown in
Figure 3c, we insert SSF module after each LN operation for parameter fine tuning. During the fine-tuning process, the pre-trained LN weight parameters are frozen, the scale factor and shift factor are updated and then merged into the original parameter space.