A novel CNN model for fine-grained classification with large spatial variants

Convolutional Neural Networks (CNN) have achieved great performance in many visual tasks. However, CNN models are sensitive to samples with large spatial variants, especially severe in fine-grained classification task. In this paper, we propose a novel CNN model called ST-BCNN to solve these problems. ST-BCNN contains two functional CNN modules: Spatial Transform Network (STN) and Bilinear CNN(BCNN). Firstly, STN module is used to select key region in input samples and get it spatially modified. Since the adoption of STN will cause an information loss phenomenon called boundary loss, we design a brand-new IOU loss method to solve it. We make a theoretical analysis of the IOU loss method. Secondly, to discover discriminative features for fine-grained classification task, BCNN module is applied. BCNN interacts CNN features from different channels to produce more discriminative bilinear features than fully connected features of CNN. ST-BCNN works by reducing irrelevant spatial states and producing fine-grained features. We evaluate our model on 3 public fine-grained classification datasets with large spatial variants: CUB200-2011, Fish100 and UAV43. Experiments show that the IOU loss method can reduce boundary loss and make STN module output spatial transformed image appropriately. Our proposed ST-BCNN model outperforms other advanced CNN models on all three datasets.


Introduction
Over recent years, the study of computer vision tasks has been pushed forward by the development of deep learning algorithms. Convolutional Neural Networks (CNN) is a primary deep learning algorithm [1]. With the connection of multiple learnable convolutional and fully connected layers, it can extract more representative features than traditional human-designed features. Despite these progresses, there're still some drawbacks in CNN. It is sensitive to samples with large spatial variants, especially severe in fine-grained classification task.
ICSP 2020 Journal of Physics: Conference Series 1544 (2020) 012138 IOP Publishing doi: 10.1088/1742-6596/1544/1/012138 2 module in any large network. It allows end-to-end training. However, due to end-to-end training strategy, it will sometimes not converge and have boundary loss.
A modified architecture is Inverse Compositional spatial transformer network (IC-STN) [3]. Different from traditional STN, it uses cascaded STNs to predict the transformation. The next STN takes the output of previous STN as input. The final transformation is multiplied by all relevant STN transformations. This network improves the classification performance but still suffers from boundary loss.

Fine-grained classification task
For classification tasks, many researches of CNN focus on coarse-grained classification. But finegrained classification is more challenging. Fine-grained classification means classifying samples from the same class into different subclasses. The samples may be very similar and may differ in just subtle parts. For samples in the same subclass, they may have very different appearance due to spatial variants. In all, fine-grained samples have small inter-class invariance and large intra-class variance.
For fine-grained classification task, there have been some studies. The studies can be divided into two types of methods: strongly-supervised method and weakly-supervised method. Strongly-supervised method needs manual object part annotations. Since samples differ in just subtle parts, comparing parts between samples rather than the entirety is realistic [4]. Major methods include part-based R-CNN [5], Mask-CNN [6] and Pose Normalized CNN [7]. Although the performance of them is better than weaklysupervised method, it's time-consuming to annotate object parts.
Without extra labels, weakly-supervised method can be trained just on class label. There exist two main types: attention-based type and bilinear pooling type. With the mechanism of attention, the weaklysupervised method can automatically detect important object parts. Recurrent Attention CNN [8] uses an Attention Proposal Network (APN) module to zoom in key parts recurrently. Multi-Attention CNN (MA-CNN) [9] divides the last convolutional layers into groups, each group means a part attention. Other important works include MAMC [10], RAM [11] and RAN [12]. Bilinear pooling [13][14] is a method to interact convolutional features on different channels. It digs into the relationship between different convolutional features since features from different convolutional channels are very rich [15][16]. It also has some variants: compact bilinear pooling, hierarchy bilinear pooling etc. However, these weakly-supervised methods don't focus on the problem of spatial variants. When the spatial variants increase, their performance will be worse.
In this paper, we aim to solve the above problems. By introducing a novel IOU loss, we solve the problem of boundary loss in STN, making the performance of STN more stable. By combining STN and BCNN, we design a novel network ST-BCNN for fine-grained classification with large spatial variants. It works by comparing samples under similar spatial states and digging into fine-grained features. It outperforms other advanced CNN models on three datasets.

Spatial transformer network (STN)
STN contains a localization network. This localization network takes the input feature and outputs a 6-dimensional vector, the vectors is reshaped into transformation matrix A θ . The localization network can be in any structure as long as the output is a 6-demensional vector. A common solution is a CNN with several convolutional layers and fully-connected layers.
Suppose that input feature map I∈R H×W×C is with height H, width W and C channels. Output feature map O∈R H ' ×W ' ×C have the same channels C but different height H' and width W'. A sampling kernel is used to get the value at a particular pixel in the output O [2]. The form of sampling kernel is (2). Since the coordinates of output feature pixels must be integers. It is approximated by its nearby transformed points.

Boundary loss.
Since STN takes an end-to-end framework. Thus the training process is usually unstable. The outputs of STN will suffer a problem called boundary loss (see Figure 2).The output image samples only from the cropped image. Pixel information outside the crop region is discarded, causing the boundary of the output image to be empty. CNN can't deal with the empty boundary very well. Boundary loss sometimes makes the performance of a CNN with STN module even worse than original CNN. L is referred generally which can include position and scale [13]. The bilinear feature combination at each position l is the matrix outer product as (3): The bilinear image descriptor Φ(I) is gotten by a pooling method of bilinear features at different locations. A common pooling method is sum pooling over all locations.
The descriptor is followed by a signed square root step y=sign(Φ(I))√|Φ(I)| and a l2-norm normalization. The bilinear feature combination and sum pooling method are all differentiable. The bilinear CNN model can be optimized with end-to-end training.
Structures of BCNN model can be divided into fully shared, partially shared and no sharing [14]. In fully shared model (Figure 4(a)), only one CNN model is used, f A (l, I) and f B (l,I) are the same feature mapping. In no sharing model (Figure 4(b)), two different CNNs are used to extract features. In partially shared model (Figure 4(c)), a part of CNN model is shared.

IOU loss
Intersection over Union (IOU) reviews intersecting area formed by original ( Figure 5(a)) and transformed images ( Figure 5(b)). More information will remain after transformation with higher IOU value. The green part in Figure 5(c) is the IOU area. There will be less information loss with larger IOU area. The maximum IOU value is 1.0 ( Figure 5(d)). In this case, there will be no boundary loss. Figure 5. IOU area However, the shape of transformed image is usually irregular quadrilateral, making the IOU value hard to calculate. We introduce an approximate calculation method to calculate IOU value. If more points on the boundary of transformed image is outside the area of original image, there will usually be higher IOU value. In this case, we inspect positions of these boundary points.
Since there're infinite points on the boundary. We select eight key points based on the flowing principle (see Figure 6). In figure 6 respectively. If is outside the area of original image, at least one of and will be greater than 1. However, if the value is infinitely great, the transformed image will be amplified too much. Considering these factors, an IOU loss is established as (5) Figure 6. Eight key points Inspired by ReLU loss [21], We establish IOU loss. If the infinite norm is less than 1, The loss will be greater than 0. But the infinite norm is nondifferentiable and can't be optimized by back propagation. We modified the loss into (6).
A single point form of IOU loss can be extended with original point coordinates (7). The derivation form of (7) is (8) and (9). The derivation about parameter , , is in the same form of parameter , while that about parameter t2 is in the same form of parameter t1. Since IOU loss is differentiable about all six parameters, it can be adopted to train CNN with back propagation.
=ReLU(1-|ax i s +by i s +t 1 |)+ReLU(1-|ax i s +by i s +t 2 |)  Figure 7. Structure of ST-BCNN Spatial variants make CNN produce many redundant features, which expends feature space. If we can reduce spatial variants, thus the irrelevant subspace is reduced. So, we want to transform the sample before it's fed into CNN. Features from different channels of CNN are relevant in some dimensions, bilinear pooling can help to find the relationship between them. In fact, bilinear pooling can be viewed as a coupled feature transformation method to produce more discriminative features for fine-grained task.

Details of ST-BCNN
Inspired by the ideas above. We adopt two functional CNN modules: spatial transformer network and Bilinear CNN together to get a novel network called ST-BCNN. Firstly, the input image is transformed by a STN module, Thus the image is spatially modified and key part is focused. Then, the transformed image is processed by a CNN module. Finally, the output convolutional features of CNN module are processed by bilinear pooling. After a softmax layer, a class is predicted. (see Figure 7).
It is worth mentioning that in ST-BCNN, we adopt fully-sharing BCNN: only one CNN model is used to extract features. Compared with no-sharing BCNN, it costs about halt operation time. In bilinear pooling operation, only the last convolutional layer of CNN is used. Features from different channels are interacted by an outer-product operation. If there're c channels, the bilinear feature has c 2 dimensions.
In training process, the total loss is composed of two losses (10). The first part is cross entropy loss, it's used to measure classification accuracy. The second part is IOU loss, which is used to reduce information loss in the spatial transform process. A parameter α is used to balance these two losses. min(Loss) = min (L cross entropy +αL IOU ) (10) To train the model, we use a two-stage training method. Firstly, STN module is combined with CNN module. Only these two modules are trained. Secondly, the fully connected layers of CNN are replaced with Bilinear pooling layers, we finetune parameters of the whole model.

Baseline: Inception Resnet V2
We select Inception Resnet V2 [17] as our baseline CNN module. It combines both Resnet connection and inception module. On one hand, Resnet connection deepens CNN depth. On the other hand, inception module uses multi-scale receptive fields to produce multi-scale features. It performs better on ImageNet dataset than other famous CNN structure like Resnet [19] and Inception-v4 [18].

Experiments
Our STN architecture is relatively simple. It takes original image as input. It consists 2 convolutional layers, 3 max-pooling layers and 2 fully connected layers. It outputs a 6-dimensional vector. These 6 elements are parameters of transformation matrix.  [20], Fish100 datasets [23], and an Unmanned Aerial Vehicle (UAV) dataset established by ourselves. They are all fine-grained classification datasets. Besides, samples of them all have large spatial variants. We split 80% images of each dataset into training sets, while the rest is for testing.

Parameter selection
The parameter in loss will affect STN accuracy. We conduct an experiment to find the optimal value on CUB200-2011 dataset.
The experiment result is in Table 2. Performance of our baseline Inception Resnet V2 is 78.17%. If IOU loss is not applied, adding a STN module even worsens the accuracy, which is 66.17%. By applying IOU loss, the performance is better than the baseline. When the value of is 5.0, the best performance 83.42% is achieved. Figure 8 denotes the value of IOU loss with respect to steps. Log value of IOU loss is used in Figure  8 for plotting consideration. If = 0.1, the loss can hardly converge to zero. Inversely, if is no less than 1.0, the loss can converge to zero. The convergence process will be faster if the value of is greater.

Classification Experiment
To evaluate our ST-BCNN model, we compare it with other state-of-the-art models. These models include some classical CNN models (Vgg-16 [22], Inception V3 [18] and Inception Resnet V2), and some specific models for fine-grained classification problems, like BCNN and RAN (Residual attention network). We set that equals to 5.0. We conduct experiments on 3 datasets.  Table 3. We find that Inception Resnet V2 outperforms other classical CNN models in all 3 datasets. This proves Inception Resnet V2 is a suitable baseline model. All the specific fine-grained models perform better than classical CNN models. Among them, the baseline + STN using IOU loss ranks second. ST-BCNN model is the best one in all 3 datasets. It achieves around 1% accuracy higher than baseline + STN model.  Figure 9 denotes some examples of transformation made by STN module. By using IOU loss, the results avoid boundary loss: there is no empty area on transformed image. Transformed images focus on key area on input image like attention mechanism. In figure 9, objects are zoomed in by STN module. However, different from common attention method, STN module makes a translation modification (Figure 9(b)), making the classification performance better than common attention method.
(a)Bird (b)UAV (c) Fish Figure 9. Some examples of STN transformation (In the left of each example is original image, and the right are transformed images).

Conclusion
In this paper, we propose a novel CNN called ST-BCNN to solve fine-grained classification with large spatial variants. ST-BCNN contains two functional CNN modules: Spatial Transform Network (STN) and Bilinear CNN(BCNN). Considering boundary loss in STN model, we design an IOU loss method. We make detailed analysis of IOU loss to prove it reasonable and differentiable. We use a parameter to balance it with cross entropy loss. By comparing classification accuracy at different value, we find when = 5.0, the model performs best. With IOU loss, STN module avoids boundary loss and makes appropriate spatial transformation. ST-BCNN model outperforms some state-of-the-art methods in different datasets. The accuracies on CUB200-2011, Fish100 and UAV43 datasets are 84.21%, 94.23% and 86.08% respectively. ST-BCNN has both advantages of STN and BCNN, making it better than STN. It concludes our model can solve fine-grained classification with large spatial variants very well.