Scene Classification of Remote Sensing Images Using EfficientNetV2 with Coordinate Attention

The high intra class diversity of remote sensing image scene often leads to the problem of difficult classification of remote sensing image scenes. Therefore, this paper proposes the CA-EfficientNetV2 model, embedding the coordinate attention into the head of the EfficientNetV2 network to enhance the classification effect. The coordinate attention is used to generate the position relationship between image spaces and channels so as to learn features efficiently. We trained three improved models CA-EfficientNetV2-S, CA-EfficientNetV2-M and CA-EfficientNetV2-L on UC Merced remote sensing dataset respectively. The classification accuracy reached 99.55%, 97.49% and 97.09% respectively. Among them, CA-EfficientNetV2-S had the best effect, which was improved by 0.8% compared with the original network.


Introduction
The remote sensing images scene classification has important application value in natural disaster monitoring [1], land cover utilization and restoration [2]. In the early remote sensing classification research, the classification method mainly relied on extracted the image low-level features, such as color histograms, texture features, and spectral information, but these low-level features did not represent the overall semantic information of the image, and required a lot of engineering technology and professional knowledge to design features. Later, the BoVW bag-of-words model and Latent Dirichlet Allocation (LDA) model to extract the image middle-level features, this type of method reduced the workload of artificially designing features and improved the accuracy of the classification model.
As deep learning [3] develops, the first researcher to apply Convolutional Neural Network(CNN) to remote sensing classification is Penatti O A B et al. [4], The result surpassed the traditional classification method. It can extract the complex image high-level features. For ordinary CNN, it may not be able to focus on the obvious features, the improvement of CNN mainly depends on three aspects of width, depth and resolution. GoogleNet [5] increased the width of the network by using different scale convolution kernels to extract features; ResNet [6] used a residual structure to suppress the network's overfitting, making the neural network develop in a deeper direction; EfficientNet [7] pointed out a new model that bring better performance, the model can balance the width, depth and resolution. EfficientNetV2 [8] improved on the former, the number of model parameters is reduced by  6.8 times, and the training speed is increased by 11 times. This article uses EfficientNetV2 as the backbone network to complete the scene classification task of remote sensing images. The main work of this article are: (1) We applied the CA (Coordinate Attention) attention mechanism [16] to the head of the EfficientNetV2 [8] network. The CA attention mechanism can focus on the long-term dependence between the spatial location information of the image and the channel for encoding, making it easier to extract high-level features.
(2) We used the UC Merced dataset to train the proposed CA-EfficientNetV2 model, analyzed the results and compared to other methods, the results indicated that the CA-EfficientNetV2 model in this article had a better classification result and achieved the expected effect.
The remainder of this article is organized as follow, Section 2 describes the research of scene classification based on attention mechanism. Section 3 describes the CA-EfficientNetV2 structure and the CA attention mechanism. Section 4 used CA-EfficientNetV2 to product experiments and compare and analyze the experimental resents. The last part is to summarize our main work and carry out future research.

Related work
For remote sensing images, features are difficult to extract, so as to improve the classification accuracy, researchers began to introduce attention mechanism [9] to boost network performance, which will make the convolutional neural network automatically learn important features from different channels. Raza A et al. [10] proposed an end-to-end capsule network, which combines an attention mechanism in the network, and used different capsule structures for encoding for different complex scene images. Guo Y et al. [11] replaced the fully connected layer of VGG with three branches, which adopted the attention mechanism of two domains, which were channel and spatial, and verified the performance of the model through complex dataset. Xu R et al. [12] combined two different attention mechanisms, control gate attention and feedback attention, in the main and non-main positions of the network, experiment to prove the validity of the model. Wan H et al. [13] designed a multi-scale fusion discrimination framework, which is used to add a lightweight attention mechanism to quickly learn important features in the channel and the ability to dilute edge information, and has a good generalization ability. Wu H et al. [14] proposed a self-attention network model with joint loss, using the attention model proposed by ResNet-18 integrated to extract image features and reduce the interference of redundant information in complex images. Alhichri H et al. [15] proposed an attention method that calculates the weighted average of the original feature matrix as the feature matrix, and uses the attention module on the 262nd layer of EfficientNet-B3, which has obvious classification effects. Although these existing methods increased the classification accuracy of convolutional neural networks by adding an attention mechanism, but they ignored the importance of the relationship between the image space and image channel.

Methodology
This article proposed a model called CA-EfficientNetV2 to solve the remote sensing image scene classification. This section introduces the CA-EfficientNetV2 structure and the CA attention mechanism.

CA-EfficientNetV2 structure
The CA-EfficientNetV2 model is mainly divided into: the backbone network module (the green dashed part in Figure 1) and the CA attention module (the orange dashed part in Figure 1). The backbone network is the EfficientNetV2 network, which is mainly composed of MB modules and Fused MB modules. The numbers 2, 8, 6, and 24 under the modules in the figure represent the number of times the modules need to be stacked repeatedly. The numbers 4, 4, 4, and 6 after the module name represent the expansion coefficient of different modules, that is, the dimension of the feature matrix is increased before entering the feature extraction module, which can improve the efficiency of feature  . Take the small model as an example.

Principle of CA Attention Module
The CA mechanism can encode the long-term dependence between spatial location information and channels, make them interrelated, and make it easier to extract high-level features information. The orange part in Figure 1 is the CA module, it is mainly consists of two steps: information embedding and attention generation. Information embedding is to encode the information of channel c by the feature matrix in two directions(horizontal and vertical), as is given in (1) (2): Among them, z c p (p) represents the output of the image height of l on the channel c, the same is true for (2). (1) (2) perform global average pooling on the horizontal and vertical directions of the image respectively, and get two one-dimensional feature codes embedded along the horizontal and vertical spatial directions. The second part is coordinate attention generation. Combine the outputs z c p (h) and z c q (w) of (1) (2), use the convolutional layer F 1 for feature extraction, and obtain the horizontal and vertical feature maps u through δ, as shown in (3): y q =σ (F q (u q )) (5) The y p and y q shown in (4)(5) are the output generated by attention, u p and u q are the mapping along the horizontal and vertical directions respectively, F p and F q are respectively for the horizontal and vertical convolution operations, coordinate attention is generated through the σ activation function.
The overall process of the final attention module can be formulated as (6): v c (i,j)=x c (m,n)×y c p (m)×y c q (n) 4 v c (i,j) is the final feature matrix generated by attention, x c (m,n) is the original input feature matrix, y c p (m) is the final output in the horizontal direction, y c q (n) is the final output in the vertical direction.

Experiment
To evaluated the accuracy of our proposed CA-EfficientNetV2 model, we used the UC Merced Land Use [17] dataset, which contains a remote sensing dataset of 21 scene categories, each category has 100 images, each image size is 256 × 256 pixels, the resolution is 0.3m, the sample picture is shown in Figure 2. The image size was set to 224 × 224 pixels in the experiments. The experimental environment was Ubuntu16.04, the graphics card was Nvidia RTX 3060, and the video memory was 12G. This article completed the coding under the Tensorflow framework, the optimizer used Adam, the learning rate was 0.0001, the batch size was 8, and the training epoch was 100. The training ratios were set to 50% and 80%, respectively.  Table 1 showed the experimental comparison results on the UC Merced Land Use dataset. In the case of "50% training ratio", the classification accuracy of the CA-EfficientNetV2-S, CA-EfficientNetV2-M, CA-EfficientNetV2-L models proposed in this paper reached 97.20%, 94.49% and 90.88%, respectively, compared to EfficientNetV2 the classification accuracy of -S, EfficientNetV2-M, and EfficientNetV2-L models were improved by 0.74%, 2.22%, and 0.42%, respectively. Among them, the CA-EfficientNetV2-S model had the best classification effect. Compared with the classic D-CapsNet [10] and Self-Attention With Joint Loss [14] models, the classification accuracy were improved by 0.32% and 1.32%, but lower than that of EfficientNet-B3-Attn-2 [15] model. In the case of "80% training ratio", the classification accuracy of the CA-EfficientNetV2-S, CA-EfficientNetV2-M, CA-EfficientNetV2-L models proposed in this paper can reach 99.55%, 97.49%, 97.07%, which is 0.8%, 1.46% and 0.91% higher than that of the original EfficientNetV2-S, EfficientNetV2-M, and EfficientNetV2-L models respectively. The CA-EfficientNetV2-S model had the best classification accuracy, and the accuracy was better than the classic D-CapsNet [10], Self-Attention With Joint Loss [14], EfficientNet-B3-Attn-2 [15] models. The experimental results showed that on different proportions of training data, the CA-EfficientNetV2-S, CA-EfficientNetV2-M, and CA-EfficientNetV2-L models had improved the classification accuracy. It can be seen that the CA module embedded in the head of the EfficientNetV2 network can improve the accuracy of remote sensing image classification tasks. The CA attention module utilized two global 1D pooling to decomposed the spatial position of the image into two feature encodings, horizontal and vertical, and aggregates into two separate position-aware, and then encoded the feature map with embedded orientation-specific information as feature matrix, the feature matrix for each position captured the distance dependencies of the input feature map along a spatial direction, and finally generated two attentions about the positional relationship. Since CA was embedded in the head position of EfficientNetV2, each image that enters the model training will generate an attention feature map embedded with position coordinates, which improved the classification accuracy of the model for We further evaluated the generalization ability of the model through a confusion matrix. Figure 3 shows the predictive ability of CA-EfficientNetV2-S for test data in the case of "80% training ratio". The diagonal numbers in the figure were the number of correctly predicted samples, and the off-diagonal numbers were the number of wrongly predicted samples. It can be seen that our model had a good prediction effect on 21 types of samples, and the overall prediction accuracy was 99.81%. Among them, there was a small amount of confusion in the "buildings" and "tenniscourt" classes, which had fewer training samples, resulting in difficulty in classification.

Conclusion
This paper proposed the CA-EfficientNetV2 model based on CA attention and EfficientNetV2 network, embedded the CA attention module into the head position of the EfficientNetV2 network, and obtained CA-EfficientNetV2-S, CA-EfficientNetV2-M, CA-EfficientNetV2-L respectively. The improved models were trained on the UC Merced Land Use dataset, and had obvious improvement effect. Among them, CA-EfficientNetV2-S had the best effect, with an accuracy of 99.55%. In the next