Fine-Graine Visual Classification with Aggregated Object Localization and Salient Feature Suppression

Fine-grained visual classification (FGVC) is desired to classify sub-classes of objects in the same super-class. For the FGVC tasks, it is necessary to find subtle yet discriminative information from local areas. However, traditional FGVC approaches tended to extract strong discriminative features, and overlook some subtle yet useful features. Besides, current methods ignore the influence of background noises on feature extraction. Therefore, aggregated object localization combined with salient feature suppression are proposed, which is a stacked network. First, the feature maps extracted by the coarse network are fed into aggregated object localization to obtain complete foreground object in an image. Secondly, the refined features obtained through zooming in complete foreground object are fed into fine network. Finally, through finer network processing, the feature maps are fed into salient feature suppression module to find more valuable region discriminative features for classification. Experiment results on two datasets show that our proposed method can get superior result compared with state-of-the-art methods.


Introduction
Different from general image classification, fine-grained visual classification (FGVC) aims to differentiate subcategories of a super-category (e.g., species of birds). The task is much more difficult than coarse-grained image classification, since visual differences between subordinate classes are often subtle and these differences are deeply hidden discriminative region areas. As a consequence, FGVC tasks have two challenges, a) large intra-class variance and small inter-class variance; b) images have quite complex backgrounds, which are also along with occlusions and scale variation. Hence, FGVC have constantly been a burdensome task.
Early works [1] generally rely on some extra information (e.g., bounding box/part annotations) to obtain accurate distinguishing regions position and learn the sophisticated feature representations. Albeit with decent results reported, these supervised learning are not best choice for FGVC as manual annotations can be labor-intensive and are often error-prone. Subsequently, weakly supervised FGVC [2,3] have been proposed with only using class label.
Some models improve classification performance by encoding higher-order information. For example, B-CNN [4] make full use of the second-order encoding to dig out more feature information, but the outer product result is high dimensional, which may prone to out of memory. Other methods find the discriminative regions by using attention mechanism. RA-CNN [5] is proposed to extend to its discriminative areas by attention network, but this method only learn a single discriminative feature in the training stage. What's more, the features of the interesting object are easily be shadowed by background noises. We argue that accurate location of complete foreground object image and the sophisticated feature representation correlate in a mutually reinforced way. Based on this assumption, aggregated object localization combined with salient feature suppression is proposed to handle the problems mentioned above of fine-grained image classification. First, complete foreground object features can be obtained through aggregated object localization, which is can effective eliminate background noises; Then, on the premise of obtaining complete foreground object features, salient feature suppression can weaken salient feature to enforce the network pay more attention to subtle features; Finally, three-branch loss function is used for constraint learning process.
Main contributions is as follows: (1) We propose a stacked network, which can effectively locate the interesting object and learn sophisticated feature representation correlated in a mutually reinforcement way.
(2) An Aggregated Object Localization is designed, which utilizes multi-layer activation map to generate the object mask. The object mask edge precisely locates the complete foreground object in the image.
(3) On the premise of getting a complete foreground object, we weaken the high response features through Salient Feature Suppression. Hence, the model can learn more useful features to enhance its robustness.

Approach
Our method's structure is described in detail. The method's structure is a stacked network, using ResNet-50 [6] as the backbone. The proposed method's structure is shown in Figure 1.

Aggregated object localization
Our complete object localization approach is inspired by SCDA [7], which generate a binary matrix to measure the distribution of object response values. Suppose represent the last convolutional feature map of ResNet-50. Activation map A can be generated by adding up feature maps F along channel direction. Then we set a threshold  , which is the mean value of A . In general, if the value of position   , i j is greater than  , It is set to 1, otherwise it is set to 0. As representation in Eq (1). The distribution of object can be determined by the intensity of each pixel in the activation maps. So, activation maps of different layers can reflect the different distribution of images. In our method, by fusing the multi-layer activation maps, an aggregated mask is obtained to enhance the object localization performance. Suppose M to obtain an aggregated mask M . The larger the aggregated mask value, the more likely it is to be an object. Hence, the aggregated mask value equals to 1 only when the sum of all the corresponding multi-layer mask values is 3, otherwise equals to 0. Finally, the object is usually in the largest connected component of M , so the smallest bounding box that contains the largest connectivity area as the result of our object location.

Salient feature suppression
We obtain complete foreground object by multi-layer aggregated mask features, which eliminate irrelevant background information. To achieve optimal feature learning based on the complete object, we design salient feature suppression network. The salient feature suppression module is shown in Figure 2.  H W C F R using channel-wise average pooling, and zoom in the same size as original image. Then sigmoid operation is performed on the upsampling feature map. As representation in Eq (2) We generate the attention feature map A where the high value regions are discriminative features, while the low values are subtle areas on which the model pay less attention to focus. Therefore, we strengthen the lower-response values in the attention feature map A , where the network can attach importance to the features of subtle regions rather than those discriminative regions. By doing this, all subtle yet discriminative features in the aggregated foreground object can be full of learned to the model. is  if it is greater than  , and 1 if it is smaller. Instead of directly erasing the discriminative features, we replace it with a small value  . By doing so, the model can mime more subtle yet discriminative features in the complete object image. As representation in Eq(4).
Finally, the aggregated foreground object I is multiplied bitwise by the  

Classifier
We construct three-branch network, each branch has a classifier that constrains its learning convergence process. We use cross-entropy loss for the whole process to train the network. The total loss is the sum of the three-branch loss.

Experimental Details
We conducted experiments with PyTorch on GTX 1080Ti GPU. During training stage, images were resized to 448  448, and the input of the three-branch network were enlarged to the same size. We used SGD as the optimizer and BN for regularization, which momentum is set 0.9 and the weight decay is set 0.001. The epoch was set 120, batch size was 6. The learning rate was 0.001 and in 60 epochs, the learning rate is multiplied by 0.1 [10]. After experimental analysis, we set  =0.5,  =0.5 for CUB-200-2011 dataset,  =0.6,  =0.7 for Stanford Cars.

Performance Comparison
To demonstrate the efficiency of method, we conduct experiments on two dataset CUB-200-2011 [8] and Stanford Cars [9] datasets, respectively. The comparison results are showed in Table 1

Ablation Studies
The effect of aggregated object localization: As shown in Table 2, when introducing aggregated object localization into ResNet-50, the model gets 85.9% accuracy on the CUB-200-2011 dataset which is 1.8% higher than the ResNet-50 baseline. The result demonstrates our component is effective.
The effect of salient feature suppression: As shown in Table 2, when introducing salient feature suppression, our approach obtains 87.8% accuracy, an improvement of 3.7% compared with baseline. The result demonstrates our component is effective as well.  Table 3. Besides, to further improve our method effective, we visualize the bounding box, where the red rectangles represent the ground-truth, the green rectangles are predicted through methods. As shown in Figure 3. The top is SCDA, the bottom is our aggregated object localization method. Our method can precisely locate the interesting object.

Conclusions
In this paper, Aggregate object localization combined with salient feature suppression is proposed, which works to extract pure features of complete foreground object and strengthen subtle feature learning correlated in a mutually reinforcement way. Extensive experiments confirmed that the proposed method improves the accuracy on different FGVC datasets.