Attention-based CNNs for Image Classification: A Survey

Deep learning techniques as well as CNNs can learn power context information, they have been widely applied in image recognition. However, deep CNNs may reply on large width and large depth, which may increase computational costs. Attention mechanism fused into CNNs can address this problem. In this paper, we summary an attention mechanism acts a CNN for image classification. Firstly, the survey shows the development of CNNs for image classification. Then, we illustrate basis of CNNs and attention mechanisms for image classification. Next, we give the main architecture of CNNs with attentions, public and our collected datasets, experimental results in image classification. Finally, we point out potential research points, challenges attention-based for image classification and summary the whole paper.


Introduction
Deep networks can mine more accurate information to express images, AlexNet [1] has obtained huge success on the ImageNet in 2012. Subsequently, due to flexible architectures, convolutional neural networks (CNNs) are extended to image recognition [2]. Besides, with the improvement of computer power from graphics processing unit (GPU), varying network architectures are presented in image recognition, which is divided into two kinds, increasing depth and width of CNNs [3]. In terms of increasing the depth of CNNs, VGG [4] used stacked small convolutional kernels to increase perception field for extracting more accurate information in image classification. Alternatively, GoogLeNet [5] utilized convolutional kernels of different types to increase the width for improving the performance in image classification. Although mentioned methods are competitive in image recognition, they still suffered from the drawbacks [6]. Firstly, deeper CNNs are easily faced with gradient explosion or gradient vanishing [6]. Secondly, wider CNNs may cause overfitting  [6]. To prevent the phenomenon, a ResNet was proposed [7]. That is, it fused outputs of current layer and previous layers as input of next layer to enhance the memory ability in image classification, which improved the performance at no extra cost in image classification [7]. To reduce the computational costs, CNNs based attention are used in image classification [8]. There are less literatures to summary CNNs based attention for image classifications. In this paper, we conduct a survey to classify these literatures, which can make readers easily understand their principles. Firstly, it gives the development of CNNs for image classification. Subsequently, we introduce basis of popular CNNs and attention techniques in image classification. Then, we show the main architectures of CNNs with attentions, experimental settings, experimental results in image classification. Finally, we point out potential research points, challenges attention-based for image classification and sum up the whole paper.

Related work
It is known that CNNs with attentions have obtained excellent performance in image classification. Thus, understanding CNNs and attentions techniques are essential to improve these methods for improving the classification results. Thus, we introduce popular CNNs and main attentions techniques in image classification in this section.

Popular CNNs for image classification
According to previous illustrations, it is known that residual networks are very effective in image classification [7]. Also, wider CNNs can extract complementary information to boost the classification results [9]. Inspired by that, a residual architecture is used to expend the width for obtaining robust information to recognizes images [9]. For instance, fusing ResNet and branch of convolutions with 3 3  and 1 1  to mine representative information in image classification [9]. ResNeXt used homogeneous and multi-branch architecture to represent the classified image [10]. Besides, using residual network as a component gathers into a CNN can improve generalization ability of a classifier [11]. Utilizing multi-scale and residual blocks to fuse different semantic information was a good tool for image classification [12]. Using residual learning techniques to fuse hierarchical features can enhance memory ability of a deep CNN in image classification [13]. To deal with insufficient samples, generative adversarial networks (GAN) employed a generative network to generate similar samples, according to given training samples [14]. Then, GAN used a discriminative network to judge truth of all the training samples for training robust classifier [14]. Besides, graph convolutional networks are very effective in multi-label image recognition [15].

Attention mechanisms for image classification
Because deep CNNs depend on deep or wide architectures, they may have huge computational cost. To address this issue, an attention method is developed [16]. That is, the attention method uses obtained features of different parts of a network as weights to act other parts for learning more substantial sequential information. Current attention methods can be divided into two kinds: channel attention [17] and spatial attention methods [8]. Specifically, channel attention method emphasizes effects of channel features on the whole CNN. The second attention method treats pixels of all dimensions at the same location as a whole and its weight is learned via each pixel at each location. All weights from a spatial attention matrix. The mentioned attention mechanisms can solve the problem from different perspectives, which can give readers inspiration.

Attention-based CNNs for image classification
According to illustrations of Section 2.1 and Section 2.2, we can see that residual network and attention methods are effective for image classification. Inspired by that, scholars combined an attention mechanism into a residual network [18]. That is divided into five kinds: single-path attention [18], multi-path attention [19], channel attention [17], spatial attention [8] and the combination of channel attention and spatial attention [20]. Specifically, a single-path attention method mainly uses its  [17] enhanced the effects of different channels to improve the classification results. Spatial attention method [8] used mean and max pooling operations to extract useful information for image classification as illustrated in Fig.1. The combination of channel attention and spatial attention (CBAM) [20] inherited the merits of spatial attention and channel attention in image classification.  Figure 1.

Fig. 1 Nine waste bottle images
To test the robustness of the mentioned CNNs with attentions in image classification under certain scene. We use residual networks with attentions on collected waste bottle dataset to test classification results. Also, the experimental settings are shown as follows.
Firstly, we crop given images to size of 224 224  via randomly horizontally and scaled operations to accelerate the training speed of classifier. Then, probabilities of ColorJitter, Gaussian blurs, and grayscale are set to 0.8, 0.5 and 0.2, which can enhance the training dataset. Finally, we normalize training images via channel mean and standard deviations to unify distributions of training samples. Besides, all the networks are optimized by Adam [23]. Also, initial learning rate parameter is 0.05, which it is varied by a cosine annealing strategy [24] with a minimum learning rate limit of 1e-4. The batch size is 128 and the number of epoch is 200.

Experimental Results
We discuss the effects of CBAM and CA on ResNet18 [7] at different locations. As shown Table 1, we can see that the ResNet18+CBAM and ResNet18+CA outperform ResNet18 on collected waste bottle image dataset, which shows the effectiveness of CBAM and CA on CNNs for image classification. Besides, ResNet18+CBAM(L=in), ResNet18+CBAM(L=pre) and ResNet18+CBAM(L=post) obtain different accuracy in waste image classification, which illustrates importance of different locations of CBAM is different. Besides, to make a tradeoff between performance and complexity, we reduce the four blocks in ResNet18 as well as Net to discuss the effects of CBAM and CA at different locations on waste bottle image classification. In Table 1, we can see that Net+CA obtains the best performance than that of other methods in terms of accuracy, flops and parameters. Specifically, ResNet18+CBAM

Potential research points and challenges
Potential research points: 1) Fusing the Transformer into CNN addresses image classification. 2) How to use the combination of CNN and attention mechanism to deal image classification under complex scenes. 3) How to use the CNN and attention mechanism to address image classification with insufficient samples. Challenges: 1) How to address the unstable training of the combination of CNN and attention mechanism. 2) How to reduce the high complexity and huge computational cost of Transformer.

Conclusion
In this paper, we summary CNNs with attention mechanisms for image classification. The survey introduces development of CNNs, basis of CNNs and attention mechanism in image classification. Subsequently, we give the main architecture of CNNs with attentions, public and our collected datasets, experimental results in image classification. Finally, we point out potential research points, challenges attention-base for image classification and summary the whole paper.