Visual attentional-driven deep learning method for flower recognition

: As a typical fine-grained image recognition task, flower category recognition is one of the most popular research topics in the field of computer vision and forestry informatization. Although the image recognition method based on Deep Convolutional Neural Network (DCNNs) has achieved acceptable performance on natural scene image, there are still shortcomings such as lack of training samples, intra-class similarity and low accuracy in flowers category recognition. In this paper, we study deep learning-based flowers’ category recognition problem, and propose a novel attention-driven deep learning model to solve it. Specifically, since training the deep learning model usually requires massive training samples, we perform image augmentation for the training sample by using image rotation and cropping. The augmented images and the original image are merged as a training set. Then, inspired by the mechanism of human visual attention, we propose a visual attention-driven deep residual neural network, which is composed of multiple weighted visual attention learning blocks. Each visual attention learning block is composed by a residual connection and an attention connection to enhance the learning ability and discriminating ability of the whole network. Finally, the model is training in the fusion training set and recognize flowers in the testing set. We verify the performance of our new method on public Flowers 17 dataset and it achieves the recognition accuracy of 85.7%.


Introduction
The main purpose of flower recognition is to make judgments of flower category though some flower attributes, such as color, texture and semantics, which plays an important role in the fields of forestry informatization and plant medicine [1]. Different from classical image recognition [2][3][4], flower category recognition is a typical fine-grained image recognition task, which requires model have strong inter-class and intra class discrimination capabilities, and also is a popular research topic in the fields of computer vision, pattern recognition and forestry informatization.
In recent years, deep learning [5] has achieved great success in computer vision, multimedia signal processing and natural language processing [6]. Although there are many classification methods in the literature [7][8][9], Deep Convolutional network (DCCN), as the most outstanding representative of deep learning, has been widely used in image classification, scene recognition, semantic information extraction, and still maintains the current best results [10,11]. In the view of this, some researchers have applied DCNN to the problem of flower category recognition and achieved good performance [12,13]. Although the methods based on DCNN can improve the accuracy and speed of flower category recognition, it still has 3 main problems: 1) The number of training samples is insufficient. DCNN always contains a lot of parameters, and training deep models in a small dataset is much challenging due to the over-fitting problem. Unfortunately, there have no public dataset with sufficient types and quantities at the same time in flower category recognition task, which directly limits the performance of the model. Even if the problem can be mitigated by data augmentation or fine-tuning on ImageNet, the useful information contained in the dataset is not increased. 2) Low recognition accuracy. Flower category recognition is a fine-grained image recognition task and has the characteristics such as high similarity between heterogeneous flowers. In addition, due to the complexity and variability of the natural environment, the pose and view angle of flowers may change unpredictably, which makes model difficult to train and the performance poor.
3) The background of image is complicated. Flower images collected from nature always have complex backgrounds and contain many noise, which may limit the recognition performance of deep learning models.
To enable deep learning quickly focus on the key points of the input data, self-attention mechanism-based model has been developed and successfully applied to many tasks, such as natural language processing and human-machine dialogue [14]. Human can quickly scan the visual information and obtain the attention target area. Not only that, but people pay more attention to the target area to get more detailed information and suppress other useless things, which is a survival mechanism formed by humans over a long period of evolution. Therefore, we proposed a novel flower recognition method based on attention mechanism (Visual Attentional-driven DCNNs, VA-DCNNs), which can effectively identify flower species accurately. The model is mainly divided into four-fold. Firstly, due to deep learning method always need massive training data to guarantee the performance, we adopt data augmentation techniques to increase samples. We rotate the picture in a clockwise angle and clip along the middle, which will be fuse with original samples as the training set for the experiments. Secondly, a Visual Attentional Learning (VAL) block are constructed for the vanilla DCNNs (we use ResNet14 and ResNet50 as the baseline in this paper), which makes VA-DCNNs have strong discriminative learning ability. Thirdly, the layer weights of the model are obtained by using dataset training. And finally, we get the recognition accuracy of the model on testing set. The experimental on the Flowers 17 public dataset prove the effectiveness of VA-DCNNs, which can achieve an accuracy of up to 85.7%. Compared with other recognition methods, VA-DCNNs can achieve better results, and has strong practicability and generalization.

Flowers 17 dataset
The experimental dataset is the Flowers 17 [15], which contains 17 common flowers in the UK. The flowers including sunflowers, hyacinths, daffodils and chrysanthemums, etc., and each category have 80 images with different pose, size and perspective. Figure 1 have shown some examples in Flowers 17 dataset. So far the dataset has been widely used in flower recognition and organ segmentation, which is one of the most representative dataset in this field.

Data augmentation
Deep convolutional neural network always composed by multiple blocks, and each block contains several convolutional layers, batch normalization layers, activation layers, and pooling layers. The data flows and gradient transfer between blocks through convolution kernels and back propagation algorithm. A DCNNs model always contains a large number of parameters need to be trained, which can make DCNNs fitting the data well. Although sufficient training samples can make the model fine-training, augmenting the training set to increase the number of training samples is one of the most common techniques used in deep learning models to further enhance the generality and robustness of the model [16]. In this paper, we augment the original flower image by using rotating and forward cropping. Specifically, for each original 224 × 224 pixel flower image, we rotate it clockwise, perform forward cropping every 30° and save it as 224 × 224 pixel size. The rotate operation totally perform 4 times (30°, 60°, 90°, 120°), and obtain a new dataset with five times than original set in quantity. The new dataset is divided into 70% training set, 20% validation set and 10% testing set, randomly. Since this data augmentation technique has been widely used in several papers, we don't repeat it here, but more details can be found in [11].

Attentional-driven residual block
The traditional DCNNs module extracts features by using stacking convolutional, dropout, batch normalization and activation layers (as shown in Figure 2(a)). Although the effectiveness of this structure has been verified in many DCNNs models, single stack the block easily causes "gradient explosion" or "gradient disappearance" during training when network depth further increase. Deep layer blocks cannot take the input information or gradient is lost in the back-propagation process, resulting in the model cannot be trained [17]. Therefore, deep residual network (as shown in Figure 2(b)) has been proposed in [18] to improve the trainability of the DCNNs. This model adopts residual connection to connect different layer, which can ignore some unimportant blocks in training automatically. This technique can solve some problems in traditional DCNNs. In order to make the DCNNs quickly locate the focal area of the image, inspired by the human visual mechanism, we propose a Visual Attentional Learning (VAL) block (shown in Figure  2(c)) based on the attentional mechanism. Specifically, we obtain the weight of channel and spatial position of the convolution feature by performing batch normalization on the feature map. This process can be expressed as: where ∈ means C -dimensional feature map. In this paper, we adopt the features from last convolutional layer in each stage. The height of each feature is and width is ; ∅ • means batch normalization function.
is learned feature weight. Batch normalization function ∅ • can be defined as: where , means the location in the feature map ; is the channel index of feature map; ∅ is the attentional-driven block on spatial, which is used to learn feature weight on the spatial position; σ • is sigmoid function. ∅ is the attentional-driven block on channel, which is used to learn feature weight of different channel dimension; ∅ is the attention learning block that combines ∅ and ∅ , which considers both spatial location information and channel information. In order to retain the advantages of the residual technique, we add the output of the attention strategy and the residual strategy as the final output of the block after weighting the convolution features. The process can be written as: where F(•) denotes residual connection. Based on the block defined above, the model not only can actively skip some unimportant features in the training process, but also can quickly locate some important channels and spatial positions by using attention mechanism. Therefore, the model can effectively alleviate the problems of insufficient training samples and small differences between samples of the same type in the flower Recognition task

Attentional-driven residual network
We can structure any depth DCNNs models based on Attentional-driven residual block, but consider the local hardware and the scale of the dataset, we adopt ResNet14 and ResNet50 as the basic frameworks to construct Attentional-driven residual based version. The network structure is shown in Table 1. In this paper, we propose two novel methods, named VA-ResNet14 and VA-ResNet50, respectively. The input of two different depth model are both 224 224 3 color jpg images, and then connected to the first deep learning block (convolution layer 1+), which consists by a 7 × 7 convolutional layer, a batch normalization layer, an activation layer and a maximum pooling layer. Then, we add the Attentional-driven residual block in the last layer of second (convolutional layer 2+), third (convolutional layer 3+), fourth (convolutional layer 4+) and fifth (convolutional layer 5+) stage. Not only that, but we retain the residual connection structure in the model (as shown in Figure 2 (c)). Finally, the model realizes the flower classification task though global average pooling and fully connected layer. The improved model is structurally identical to the original residual network. Since attentional-driven block has few parameters, the improved network will not increase the training burden. In addition, attentional-driven learning with residual connection can prove the performance of the model will not roll back. Even in the worst case, the residual connection can jump over the attentional learning block to make it down.

Experiment setting
Based on the Attentional-driven residual network proposed above, we use Flowers 17 dataset to evaluate its performance. By randomly dividing the augmented dataset according to proportion, we have 4760 images in training set with 280 images per class; 1360 images in validation set with 80 images per class and 680 images in testing set with 40 images per class. All of the flower images are two-dimensional color image in JPG format. The data for input need normalized by subtracting the mean value. The training process adopts Stochastic Gradient Descent (SGD) algorithm [15] to optimize the hinge loss function. The batch size is set to 128. The learning rate starts with 0.01, decreases to its 1/10 every 10,000 iterations, stops at 50,000 iterations. The weight decay parameter is 0.0005. The experiment environment is Pytorch based on Python programming language. Pytorch as one of the most widely used framework in deep learning, has good scalability, modularity and high efficiency, which is very popular in the academic and industrial circles [19]. We implement all the algorithms in Think Station P320 workstation with 4 GTX 1080 Ti GPU to speed up image processing [20].

Results analysis
Based on experiment setting proposed above, we training the AL-ResNet14 and AL-ResNet50, respectively. Figure 3 shows the curve of accuracy and objective loss in validation set, where the blue curve indicates the result in VA-ResNet14 and the green curve indicates the result in VA-ResNet50. We can find that the curve become placid after about 40,000 iterations, indicating that the algorithm has been converged. In addition, the accuracy on VA-ResNet50 is high that VA-ResNet14, but the objective loss is small, means VA-ResNet50 can fitting the data better. Besides, VA-ResNet14 is volatility higher that VA-ResNet50 in Figure 3, which is due to the number of VA-ResNet14 is less. In the case of same input data, the model with less parameters are hard to find the local optimal solution [21].
(a) accuracy on validation set (b) objective loss on validation set To show the performance of Attentional-driven residual block, we provide the focus areas of flower images obtained by the attention learning in the first layer. As shown in the Figure 4, the brightly area is the model to focus on. We can see some interesting points as follow: 1) the focus area of attention is not continuous, but scattered into several bright spots. The brighter the area, the greater the role it plays in the classification, and the higher wright it corresponding to. It indicates that not all part of flowers plays an important role in flower recognition task. 2) Compare to original input image, the focus areas of flower are always corresponding to more colorful part in flower, indicating that the color information is the key point to discriminate the flower. Besides, since the attention mechanism puts more effort on flower, the noise in background has no effect on flower recognition task, which bring robustness to model. Further, we visualize the convolution features from first to third layers of some flowers. Figure 5 shows the features visualization results of sunflower, snowdrop and tiger lily, respectively. From Figure 5 we can see the following conclusions. Firstly, the convolutional features learned by the shallow network are mainly understandable features such as texture and color, while the features learned by deep layers are more abstract, like outline or shape.
3) The feature from shallow layers are often high-resolution information, while the deep layers are more likely to extract some semantic information. Therefore, the resolution of images is gradually decreasing with the layer deep. In the classification process, the semantic information determines the image "what is", while the shallow features determine the discriminative information "where as" in the image.

Method comparison
In order to further verify the effectiveness of the methods proposed in this paper, we compare our methods to some popular image classification techniques. We ensure all the parameters are consistent with the original text to guarantee the algorithms optimization. The results on the testing set are shown in Table 2. Comparing our method with VGGNet [22], Network In Network [23], GoogLeNet [24], and Inception V3 [25], we can find that the method proposed in this paper has higher accuracy. Specifically, VA-ResNet14 and VA-ResNet50 have accuracy improvement of 1.7 and 3.6% than ResNet14 [18] and ResNet50 [18], respectively. This indicates that the proposed method has good universality. VA can still improve the model performance, even on the DCNN model with a strong presentation ability. Also, it can be found that ResNet with VA blocks shares very higher accuracy as compared to VGGNet [22], Network In Network [23], GoogLeNet [24], and Inception V3 [25].

Conclusions
In this paper, we propose a novel Attentional-driven residual network model for flower recognition. By adding an attention connection to each residual block, the model can learn from different channel features and different spatial dimensions, and at the same time, can maintain the capability of few-shot learning to compensate training samples insufficient. In order to verify the feasibility and effectiveness of the methods proposed in this paper, we take the experiments on Flowers 17 dataset. The experiments show that our method can achieve the accuracy of 85.7%, which is higher than the existing image classification methods without introducing additional training parameters. Although the methods proposed in this paper is initially designed for flower recognition, it has strong scalability and practicability that can be easily applied to other object recognition tasks, such as terrain recognition, farmland recognition, and forest recognition on remote sensing images.
In addition, our future work will focus on the following aspects. 1) Expand the flower database. Not only we expand the number of flower varieties, but also expand the number of images in each species. 2) Since our methods are supervised learning model, which need using a lot of labeled data in training process, one of our future projects is to combine with some advanced technique, like semi-supervised learning, one/few-shot learning. 3) Another project will focus on the transfer learning and data generation technique based on natural image datasets to improve the generalization ability and robustness of the model.