Progressive Training Technique with Weak-Label Boosting for Fine-Grained Classification on Unbalanced Training Data

Jin, Yuhui; Wang, Zuyun; Liao, Huimin; Zhu, Sainan; Tong, Bin; Yin, Yu; Huang, Jian

doi:10.3390/electronics11111684

Open AccessFeature PaperArticle

Progressive Training Technique with Weak-Label Boosting for Fine-Grained Classification on Unbalanced Training Data

¹

State Key Laboratory of Software Development Environment, Beihang University, Beijing 100191, China

²

Beijing Transportation Comprehensive Law Enforcement Corps Support Center, Beijing 100044, China

³

China Institute of Geo-Environmental Monitoring, Beijing 100081, China

⁴

The Affiliated High School of Peking University, Beijing 102218, China

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(11), 1684; https://doi.org/10.3390/electronics11111684

Submission received: 12 April 2022 / Revised: 20 May 2022 / Accepted: 23 May 2022 / Published: 25 May 2022

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In practical classification tasks, the sample distribution of the dataset is often unbalanced; for example, this is the case in a dataset that contains a massive quantity of samples with weak labels and for which concrete identification is unavailable. Even in samples with exact labels, the number of samples corresponding to many labels is small, resulting in difficulties in learning the concepts through a small number of labeled samples. In addition, there is always a small interclass variance and a large intraclass variance among categories. Weak labels, few-shot problems, and fine-grained analysis are the key challenges affecting the performance of the classification model. In this paper, we develop a progressive training technique to address the few-shot challenge, along with a weak-label boosting method, by considering all of the weak IDs as negative samples of every predefined ID in order to take full advantage of the more numerous weak-label data. We introduce an instance-aware hard ID mining strategy in the classification loss and then further develop the global and local feature-mapping loss to expand the decision margin. We entered the proposed method into the Kaggle competition, which aims to build an algorithm to identify individual humpback whales in images. With a few other common training tricks, the proposed approach won first place in the competition. All three problems (weak labels, few-shot problems, and fine-grained analysis) exist in the dataset used in the competition. Additionally, we applied our method to CUB-2011 and Cars-196, which are the most widely-used datasets for fine-grained visual categorization tasks, and achieved respective accuracies of 90.1% and 94.9%. This experiment shows that the proposed method achieves the optimal effect compared with other common baselines, and verifies the effectiveness of our method. Our solution has been made available as an open source project.

Keywords:

unbalanced training data; progressive training; weak-label boosting; instance-aware hard ID mining strategy; feature-mapping loss

1. Introduction

Of the many factors that can affect the accuracy of classification tasks, all fall into two aspects: one is the problem of data, and the other is the problem of the classification task itself. While in an ideal situation, a model can learn the key features of each category and use them to make final classification decisions, in actual training tasks the sample distribution may be extremely uneven, such as in the case of weak labels [1,2,3,4,5,6] and few-shot [7,8,9,10,11,12,13] problems. For the classification task itself, the goal of classification is to find the decision boundaries between categories, however, the dataset may have a small interclass variance and a large intraclass variance. These issues are the key factors affecting classification accuracy.

For fine-grained visual categorization task, the gap between IDs belonging to the same category is small. As shown in Figure 1, instances belonging to the same ID may be different from each other while being similar to instances belonging to other IDs, that is, a small interclass variance and a large intraclass variance. During the training process, if the two samples have similar content and different class labels, then using the cross-entropy loss function [14] can force the neural network to mine features with higher confidence, thereby reducing the training loss. This situation is particularly severe in fine-grained classification tasks. Fine-grained categorization has been a popular research topic in the fields of computer vision and pattern recognition in recent years. Its purpose is to obtain more detailed subcategories from larger coarse-grained categories. Achieving a high classification accuracy for fine-grained images is more difficult than for coarse-grained images, and the differences between classes are more subtle. In general, different categories can be distinguished only by means of small local differences. The intraclass differences in fine-grained images are much larger than in object-level classification tasks such as face recognition [15,16], and there are many uncertain factors, such as pose, lighting, occlusion, and background interference. The process framework of most fine-grained image classification algorithms is as follows: First, the foreground object and its local areas are found, and then features are extracted from these areas. The obtained features are then used to complete the training and prediction of the classifier after proper processing. The signal-to-noise ratio of fine-grained images is small, and the information containing sufficient discrimination information often exists only in small local areas. How to find and effectively use useful local area information has become a key factor in developing fine-grained image classification algorithms. Several traditional methods rely on manual labeling, which mainly includes bounding boxes and part locations, while other methods rely solely on the category label. Feature extraction is another key factor in determining image classification accuracy. Current mainstream traditional deep learning technology requires a large amount of data to train a good model, and each class needs at least thousands of training samples to saturate the convolutional neural network (CNN) performance on known categories. In addition, the generalization ability of neural networks is weak. It is difficult for the model to learn novel concepts through a small number of labeled samples when faced with novel classes.

In actual training tasks the sample distribution may be extremely uneven, such as in the dataset provided by the Kaggle competition. The sample distribution is extremely unbalanced, and the number of samples corresponding to many IDs is small (less than ten). If this issue is not considered and the model is trained using all of the data, oversampling the unbalanced sampling in the traditional way, the model then overfits the few-shot IDs due to the high sampling rate. A conventional method is to train the model using all of the data with oversampling in order to balance the classes. However, the model would then overfit the few-shot IDs due to the high sampling rate. There are several possible simple solutions based on the following considerations. The idea behind this type of method is straightforward and simple. If the training data are insufficient, the number of training samples should be increased. In the case of overfitting, regularization techniques have been used. Both of these techniques are used in our method.

Traditional supervised learning usually assumes that each instance is associated with a label. However, in many real-world tasks an instance may not belong to a known label. In the task presented in this article, especially, the IDs of numerous samples are unavailable. We know that these samples do not belong to any known IDs. Traditional supervised learning based on one instance and one label cannot solve this problem. Therefore, how to deal with the label problem to help improve the accuracy of the detection model is the challenge of this paper. Weak-label learning research has been raised in the past few years; [1] and [2] are two early studies in this direction. The WELL approach was presented in [1] based on the assumption that instance similarities are determined by a group of low-rank similarity matrixes. The MLRGL approach was presented in [2] with the use of group lasso to regularize the training errors. Recently, many learning methods have tried to solve weak-label problems. Examples include methods based on label co-occurrence, sparse reconstructions, and low-rank matrix completion. The weak-label problem occurs in other learning scenarios as well, such as multi-instance multilabel learning [3]. However, weak-label learning methods are not sufficient to handle semisupervised weak-label data well, as they neglect the exploitation of a large number of unlabeled instances known to be very useful.

In the present work, we propose a novel architecture including several useful modules to solve these two types of problems and improve the accuracy of classification tasks. To verify the ability of our method in practical tasks, we applied the methods to a Kaggle competition task. In the competition, participants are challenged to build an algorithm to identify individual whales in images from Happy Whale’s database, which contains over 25,000 images gathered from research institutions and public contributors. We then applied our method to CUB-2011 [17] and Cars-196 [18], which are the most widely-used datasets for fine-grained visual categorization tasks, in order to verify the effectiveness of our method. Aiming at the main difficulties discussed above, our work has made the following contributions.

We introduced an instance-aware hard ID mining strategy in the classification loss and add feature-mapping loss to expand the decision margin. Specifically, we employed the batch hard triplet loss as the feature mapping loss to encourage both a smaller intraclass distance and a larger interclass distance.
For the few-shot learning problem, we developed a progressive training technique. In the first stage, we trained the whole network using only the normal IDs which contain ten or more samples. In the second stage, we used all of the data, including normal IDs, few-shot IDs, and weak IDs, to train the model. Additionally, the backbone was partially fixed to prevent overfitting.
In order to take full advantage of weak IDs, we propose a weak-label boosting algorithm that considers the weak IDs as negative samples of every predefined ID, which is both intuitive and reasonable.
With common training techniques, our model won first place in the Kaggle challenge. On CUB-2011 [17] and Cars-196 [18] we achieved competitive results, with a respective accuracy of 91.3% and 96.1%. This shows that the proposed method can achieve the optimal effect and verifies the effectiveness of our method. Our solution has been made available as an open source project.

The remaining parts are organized as follows. First, related works are presented in Section 2. Next, we present the details of our method in Section 3, followed by Section 4, where the experimental settings are introduced. We discuss the results and contributions in Section 5. Finally, Section 6 concludes our paper and presents possibilities for future work.

2. Related Works

2.1. Fine-Grained Image Classification

Research on fine-grained image classification has experienced a long period of development following its inception. Earlier algorithms based on feature extraction often faced limited classification effects because of limited ability to express features. In recent years, with the rise of deep learning, the powerful feature extraction capabilities of neural networks have promoted rapid progress in this field. Strong supervision uses additional manual labeling information, such as bounding boxes and key points, to obtain the position and size of the target, which is conducive to improving the correlation between local and global areas and thus to improving classification accuracy. Weak supervision refers to the use of image labeling information without additional labeling information. In [19], a recurrent attention CNN (RA-CNN) was proposed which can recursively learn differentiated regional attention and multiscale region-based feature representations in a mutually reinforcing manner. In order to handle the computational requirements of high-dimensional features, Ref. [20] represented the covariance features as a matrix and used a low-rank bilinear classifier. As a result, the classifier does not need to explicitly calculate a bilinear feature map to be verified, meaning that the calculation time and the effective parameters of learning can be greatly reduced. In [21], a two-stream model was proposed that combines visual language information to learn potential semantic representations. In [22], the problem was expressed as a sequential retrieval of informational parts on the deep feature map generated by deep CNN. One case of this research is that a group of proper bounding boxes on the image are verified by a heuristic function with information, and a new candidate box is generated by a successor function. In [23], an end-to-end fine-grained visual categorization (FGVC) framework was proposed based on hierarchical convolution activation of high-level integration. Using convolutional activation as a local descriptor, hierarchical convolutional activation can handle the representation of local regions of different scales. In [24], a new part learning method was proposed. Through a multi-attention CNN (MACNN), part generation and feature learning can be mutually enhanced. MACNN includes three subnetworks: convolution, channel grouping, and location localization. In [25], a variant of a generative adversarial network was proposed; this is a universal learning framework that combines a variant of an autoencoder and a generative adversarial network to synthesize images of fine-grained categories such as specific faces and targets. In [26], the authors proposed introducing confusion in the output logit activation to force the network to mine fewer discriminative features, thereby reducing overfitting on specific samples. Specifically, the network can be obfuscated by minimizing the distance between the training sample random sample pair and the predicted probability distribution. In [27], the authors proposed an additive angular margin loss (Arc-Face) in order to obtain highly discriminative features for face recognition. The proposed Arc-Face method has a clear geometrical interpretation due to its exact correspondence to the geodesic distance on the hypersphere.

2.2. Few-Shot Learning

Early few-shot learning algorithm research mostly focused on the image field. Few-shot learning methods can be roughly divided into three categories, namely, model-based [7,8,9], metric-based [10,11], and optimization-based methods [12,13].

The model-based method aims to quickly update the parameters on a small number of samples through the design of the model structure and directly establish the mapping function of the input and the predicted value. In [7], a novel edge-labeling graph neural network was proposes which adapts a deep neural network on the edge-labeling graph for few-shot learning. In addition, [8] allows for the robust adaptation of a source-trained model to target a domain with few target samples by carefully designing the adaptation modules and imposing proper regularization. The fast generalization ability of the meta network [9] is derived from its “fast weights” mechanism. The gradient generated during the training process is used for fast weight generation. The model contains both a meta learner and a base learner.

The metric-based method measures the distance between the samples in the batch set and the samples in the support set and uses nearest neighbor classification based on the idea of optimization. Compared to a twin network, a match network [10] constructs different encoders for the support set and the batch set. The output of the final classifier is a weighted sum of the predicted values between the support set samples and the query. The prototype network [11] is based on the idea that a prototype representation exists for each category, and the prototype of this class is the mean of the support set in the embedding space.

The optimization-based method considers that ordinary gradient descent methods are difficult to fit in a few-shot scenario; thus, the optimization method is adjusted to complete the task of small-sample classification. In [12], the authors studied the reasons for the failure of gradient-based optimization algorithms with a small amount of data (these methods cannot be directly used for meta learning). The authors used an update function or update rule of a model parameter. Instead of training a single model in multiple rounds of episodes, this trains a specific model in each episode. The method proposed by [13] makes it possible to obtain better generalization performance with a smaller number of iterations steps on a smaller number of samples, and the model is easy to fine-tune. The core idea is to learn the initialization parameters of the model in order to maximize the accuracy of the new task after one or more iterations.

2.3. Weak Labels

Learning with noisy labels is mainly about training classification models in the presence of inaccurate class labels. In [4], the authors solved the problem of picking the correct label based on the labels provided by many labelers with different areas of expertise. As opposed to the above supervised learning methods with noisy labels, in [5] the authors proposed a weakly supervised semantic segmentation framework to deal with noisy labels. The weak-label problem occurs in other learning scenarios as well, such as multi-instance multilabel learning [3]. However, weak-label learning methods are not sufficient for semisupervised weak-label learning, as they neglect the exploitation of a large number of unlabeled instances that are known to be very useful.

Semisupervised multilabel learning falls into two categories. One category is transductive multilabel learning, which assumes testing instances are from unlabeled instances [28,29]. In [28], the authors assumed that the similarity in the label space is closely related to that in the feature space, and thus employed the similarity in the feature space to guide learning on missing label assignments, leading to a constrained non-negative matrix factorization optimization. In [29], the authors formulated a transductive multilabel learning method as an optimization problem in estimating label concept compositions. The other category is pure semisupervised multilabel learning, which can make multilabel predictions for any unseen instance [6]. In [6], the authors proposed an inductive co-training style method to address this problem. They generated two classification models: first, the feature space was dichotomized with diversity maximization to handle multilabel data; then, pairwise ranking predictions on unlabeled data were iteratively implemented for model refinement. Nevertheless, although semisupervised multilabel learning takes the incompleteness of relevant labels into account, it continues to assume that full relevant labels are available for labeled instances, and such an assumption does not hold in semisupervised weak-label learning.

3. Methods

In this section, we first introduce our model, then detail the key methods, and finally discuss the other tricks and training details we adopted.

3.1. Model

The goal of the Kaggle competition is to identify the predefined ID of a whale from a provided image of a humpback whale fluke. In order to achieve a high identification accuracy, it is necessary to take full advantage of few-shot IDs and weak IDs. For the former, we developed a progressive training technique. For the latter, we propose a weak-label boosting algorithm. As illustrated in Figure 2, our model consists of two key modules, an image feature extraction module and a feature description module.

3.1.1. Feature Extractor

The image feature extraction module uses SENet-154 [30], which is pretrained on ImageNet [31] as the backbone network. After being provided with a whale image as input, the model first extracts a feature with a fixed shape C×H×W. We tried to ensemble our SENet-154 model with other networks, such as SEResNeXt-101 [32,33] and DPN-131 [34]; however, we did not obtain any boost.

SENet’s top-five error on the ImageNet test set was 2.251%, which won the final classification task of the ImageNet 2017 competition. The squeeze-and-excitation (SE) module was mainly used to improve the sensitivity of the model to channel characteristics. This module is lightweight and can be applied to the existing network structure. It needs only a small amount of additional calculation to improve performance. We used SENet-154, which is based on the ResNet model. The core of feature extraction is a convolution operator, which learns a new feature map from the input feature map through a convolution kernel. In essence, a convolution operation is a feature fusion of a local area which includes feature fusion in space and between channels. The innovation point of the SENet model is to focus on the relationship between channels. It is hoped that such a model will be able to automatically learn the importance of different channel characteristics. To this end, the SENet model proposed the SE module.

The SE module first performs a squeeze operation on the feature map obtained via convolution in order to obtain channel-level global features, then performs an excitation operation on the global features to learn the relationship between each channel, obtains the weights of different channels, and finally multiplies the original feature map to obtain the final feature. In essence, the SE module performs attention or gating operations on the channel dimension. This attention mechanism allows the model to pay more attention to the channel features with the most information while suppressing those channel features that are not important. Another point is that the SE module is universal, which means that it can be embedded into existing network architectures.

3.1.2. Feature Descriptor

The feature extraction module processes the input whale image and obtains a 2048-dimensional image feature. The descriptor module processes the image feature and obtains two descriptors, namely, a global descriptor and a local descriptor.

Global Descriptor. The global descriptor is obtained by global average pooling (GAP) followed by an L2 normalization layer to obtain a 2048-dimensional representation (global descriptor) and finally produces the ID scores by an FC layer. The GAP layer, which calculates the average value across all the pixel points for the feature map and outputs one value, is used to regularize the entire network structure to prevent overfitting. The GAP layer has no data parameters. GAP reduces the number of parameters to avoid overfitting, and is more in line with the working structure of CNN; each feature map is associated with the category output rather than the unit of the feature map directly. The category output is correlated. L2 regularization can help to prevent overfitting.

Local Descriptor. We used an extra average pooling layer and an L2 normalization layer parallel to the global pooling layer to obtain a local descriptor of size 2048 × H × 1 for computing the local matching loss, which can help the model to encourage both a smaller intraclass distance and a larger interclass distance. The feature-matching loss is discussed in detail later in this paper.

3.2. Progressive Training

Few-shot samples make up the majority of the training dataset used in the Kaggle competition. A conventional method can be used to train the model using all of the data with oversampling for class balancing; however, the model then overfits the few-shot IDs due to the high sampling rate. In order to solve this problem, we developed a two-stage progressive training technique, as illustrated in Algorithm 1.

In the first stage, we trained the whole network (the backbone and FC) using only the normal IDs; the backbone network was pretrained on ImageNet. In the second stage, we used all the data to train the network. Additionally, the backbone was partially fixed in order to prevent overfitting; we fixed conv1, layer1 and layer2 of SENet-154. Additionally, we added a third stage using a small learning rate for better convergence.

In the first two-stage training process, we used only the normal samples to learn discriminative representation. The few-shot samples were identified based on pretrained feature mapping, which effectively alleviates the overfitting problem caused by the oversampling approach.

Algorithm 1: Progressive Training Strategy.

Step 1. Train the whole network (the backbone and FC) using only the normal IDs.

Step 2. Partially fix the backbone and use all the training data, including few-shot samples, to train the network for m epochs.

Step 3. Use a small learning rate in n epochs for better convergence.

3.3. Weak-Label Boosting

The dataset contains a massive number of samples with weak labels for which concrete identification is unavailable. Two trivial methods can be used to process these data.

(1) Discard them. In this way, the FC layer produces a 5004-dimensional (number of normal IDs) score. Then, the softmax function can be used to obtain a normalized confidence and compute the cross-entropy loss. However, this approach fails to utilize the numerous data with weak labels.

(2) View them as an ordinary class. The FC layer can provide a 5005-dimensional score where the last dimension stands for the score of the weak ID. Then, the loss can be computed similarly to (1). However, this strategy suffers from a severe probability unreliability problem. Experiments demonstrate that its performance is not as good as method (1).

Based on the observations above, we propose a weak-label boosting technique. When training the backbone and the FC layer, we considered the weak IDs as negative samples of every predefined normal ID, which is an intuitive and reasonable approach. In a case where the number of positive samples of predefined IDs remains unchanged, the number of corresponding negative samples can be greatly increased. In combination with sigmoid-based loss, which is actually a combination of multiple binary classifiers and is only responsible for the probability that an image belongs or does not belong to the current category, a larger interclass distance can be encouraged. For few-shot IDs, the same procedure can be executed.

3.4. Loss Function

3.4.1. Instance-Aware Hard ID Mining

Although the softmax loss has been widely used in traditional classification tasks, in the context of this research the vanilla softmax loss usually cannot be qualified for a classification task with more than 5000 classes [15,16,27]. The softmax loss does not explicitly optimize the feature embedding; that is, it does not make intraclass samples have higher similarity and interclass samples have lower similarity, resulting in decision boundaries that are not enlarged. In addition, the softmax loss has high operational requirements which dramatically increase with the number of classifications.

Based on the above considerations, we introduced an instance-aware hard ID mining strategy. When provided with an instance, the FC layer produces a score vector with C (5004) dimensions. Then, we used a sigmoid function to normalize each score and computed the binary cross-entropy loss separately. Only the top k terms were allowed to backpropagate:

\begin{matrix} L_{c l s} = & - \sum_{c \in C^{'}} [y_{c} log s_{c} + (1 - y_{c}) log (1 - s_{c})] \\ C^{'} = \underset{C^{'} \in P (C)}{argmax} L_{c l s}, s . t . |C^{'}| = K \end{matrix}

(1)

where

c \in C = {1, 2, \dots N}

,

y_{c} \in {0, 1}

refers to whether the ID of the instance is c, and P(C) is the power set of C. We set

K = 30

for all the experiments.

There are three reasons to use sigmoid-based loss: (1) a more accurate threshold of new whales can be obtained by training with weak-label data; (2) a softmax-based loss without a margin always places new whales in the category region of a certain class; and (3) in this dataset, different whales have similar visual content. Therefore, minimizing the cross-entropy loss leads to interclass competition between similar classes, potentially forcing the network to learn sample-specific features. A sigmoid-based loss, on the other hand, makes categories more independent and reduces the interclass competition between them.

3.4.2. Feature-Mapping Loss

Except for the classification loss, we used the global and local feature-matching loss [35] to further improve performance. Specifically, we encouraged both a smaller intraclass distance and a larger interclass distance and employed batch hard triplet loss [36,37]. The batch size was set as M × P, i.e., M IDs and P instances were selected for each ID. This loss can be expressed as

\begin{matrix} L_{F M} = \sum_{i = 1}^{P} \sum_{a = 1}^{M} [m + max_{p = 1, \dots, M} D (f (x_{a}^{i}), f (x_{p}^{i})) - min_{\underset{j \neq a}{\underset{n = 1, \dots, M}{j = 1, \dots, P}}} D (f (x_{a}^{i}), f (x_{n}^{j}))]_{+} \end{matrix}

(2)

where

x_{j}^{i}

is the j-th instance of i-th ID,

{[\cdot]}_{+}

is the rectified linear unit (ReLU) function

max (\cdot, 0)

, and m is the margin hyperparameter, which we set to 0.3 for all of the experiments.

3.5. Other Tricks

Flipping Equivariance. Empirically speaking, the texture and outline of a tail, which are two crucial traits of a whale, are asymmetrical. Unlike general image classification tasks, a flipped image should be considered as a new ID. Using this task-specific data augmentation technique can significantly improve performance.

Pseudo Labels. We obtained the classification confidence of all the test images using the initial trained model, and then added approximately 2000 test images with confidence greater than 0.96 to our training set. The final model was trained on the expanded training set with a smaller learning rate.

Four-Fold Cross-Validation with Class Balance. During our continuous improvements (from 0.8+ to 0.96), we found that the number of labels was correlated with the score. Thus, we used the following strategy to further balance our predictions. For the top five predictions class 1 to class 2, if

{conf}_{{class}_{1}} - {conf}_{{class}_{2}} < 0.3

, class 2 was not used in all the top single predictions and class 1 had been used in the top two predictions many times, we switched the positions of class 1 and class 2.

Data Augmentation. Random cropping, rotation, rescaling, shifting, and equivariant flipping tricks were employed to augment the training samples; we used Gaussian noise and speckle noise as well for heavier data augmentation.

3.6. Implementation Details

We trained the whole network in a progressive multistage manner. SENet-154 was used for local and global feature extraction. We used Adam optimizer [38] to learn the CNN parameters, with

β = (0.9, 0.99)

. We initialized the CNNs with pretrained models from ImageNet. The initial learning rates and batch size were set to 3e-4 and 32, respectively. In the first stage, we trained the model using normal data containing one or more samples for five epochs. In the second stage, we fixed conv1, layer1, and layer2 of SENet-154. The batch size was set to 64. We trained the model on the whole dataset for ten epochs. We added an extra (third) stage, where the learning rate was set to 3e-5 for five epochs for better convergence. Our entire pipeline was built on the PyTorch framework [39]. To process the model described above, we used eight Nvidia Titan X (Pascal) GPUs; the full training task required approximately eight hours.

4. Experimental Settings

In this section, we introduce the dataset used in the competition along with the publicly available datasets used for the other two categorization tasks, then discuss the evaluation methodology and details of the selected baselines.

4.1. Dataset

Kaggle competition. In the competition, participants are challenged to build an algorithm to identify individual whales in images from Happy Whale’s database. These training data contain thousands of images of humpback whale flukes with corresponding IDs. Individual whales have been identified by researchers and assigned an ID. The sample distribution is extremely unbalanced; thus, we split these IDs into three classes: (1) normal IDs, which contain ten or more samples; (2) few-shot IDs, in which the number of training samples is less than ten; (3) weak IDs, which means that the whales do not belong to the known IDs. The details of the training set are shown in Table 1. The test set contains 7960 images, and the ID set is the same as the training ID set except for the weak IDs. The challenge is to predict the whale ID of the images in the test set. What makes this task such a challenge is that there are only a few examples for each of the 3000+ whale IDs.

CUB-2011 and Cars-196. The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization tasks. It contains 11,788 images in 200 subcategories belonging to birds, of which 5994 are for training and 5794 for testing. The Stanford Cars dataset consists of 196 classes of cars with a total of 16,185 images, taken from the rear. The data are divided into almost a 50–50 train/test split, with 8144 training images and 8041 testing images. The categories are typically Make, Model, and Year. The images are 360 × 240.

4.2. Evaluation Methodology

The results reported are top-one accuracy for CUB-2011 and Cars-196. For the competition, the models are evaluated according to the mean average precision

(MAP) @ 5

MAP @ 5 = \frac{1}{U} \sum_{u = 1}^{U} \sum_{k = 1}^{min (n, 5)} P (k) \times rel (k)

(3)

where U is the number of images; n is the number of predictions per image; the upper limit of k is set to

min (n, 5)

, as this value is specified in the game (for each image in the test set, up to five labels for the whale ID may be predicted); P(k) is the precision at cutoff k; P(k) is the accuracy rate, where

P (k) =

truepositive/(truepositive+ falsepositive); and rel(k) is an indicator function that is equal to 1 if the item at the rank is the relevant (correct) label and zero otherwise. The whales that are not predicted to have a label identified in the training data should be labeled as new-whale. Once a correct label has been scored for an observation, that label is no longer considered relevant for that observation, and additional predictions of that label are skipped in the calculation. For example, if the correct label is A for an observation, the following predictions all score an average precision of 1.0.

We can further probe into the sparsity of the distribution of the centers in order to improve the discriminativeness by extracting the row vectors (which represent the centers of the corresponding IDs) from the last FC layer and calculate the pairwise distance between them. The distance metric used here is the cosine distance:

d_{cos} (\vec{u}, \vec{v}) = 1 - \frac{\vec{u} \cdot \vec{v}}{∥ \vec{u} ∥ \cdot ∥ \vec{v} ∥}

(4)

4.3. Baselines

For the competition, we mainly analyze the results of different variants of our method.

The baseline methods for Cars-196 are as follows: OPAM [40] introduced triplets of patches with geometric constraints to improve the accuracy of patch localization and automatically mine discriminative geometrically-constrained triplets for classification. However, the resulting approach requires object bounding boxes. In [41], the authors used competitive convolution-free transformers trained on ImageNet with a teacher–student strategy. The approach used in [42] learns fine-grained features by enhancing the semantics of sub-features of a global feature. In [43], the authors used two channel-specific components, namely, a discriminality component and a diversity component. The fine-tuned DenseNet models in [44] possessed no task-specific adaptations.

For CUB-2011, the baseline methods are as follows: [45] first used an object-extent learning module for localizing the object according to the visual patterns shared among the instances in the same category, then a spatial context learning module to model the internal structures of the object. The approach used in [46] obtains knowledge from large scale datasets and then fine-tunes for domain-specific fine-grained visual categorization. A weakly supervised data augmentation network was proposed in [47] to explore the potential of data augmentation, which is usually adopted to increase the amount of training data, prevent overfitting, and improve performance. In [48], the authors proposed a novel bidirectional attention-recognition model to actualize bidirectional reinforcement for the classification task. The approach used in [49] can predict the position of the object, and the attention part proposal module can propose informative part regions.

5. Results and Discussions

In this section, we focus on the results and comparisons of our method. For each of the modules, we performed detailed experiments in order to illustrate the impact of each part of the model on the final result. In addition, we performed an intuitive analysis of the results obtained by the model.

5.1. Quantitative Results and Analysis

Kaggle Competition. As seen in Table 2, when all modules are integrated, including the progressive training strategy and various tricks introduced above, we finally reach a score of 0.973 on the test set, exceeding the second place model’s score of 0.972 and the third place model’s score of 0.971. This finding proves that our method can achieve a better recognition effect with fine-grained analysis, few-shot, and weak-label issues. When using vanilla SENet-154, the score is 0.927, and the classification accuracy exceeds 0.9. This is a fairly high starting point score, and the classification effect is good and acceptable. On this basis, we take full advantage of the weak-label samples by adopting the weak-label boosting method. During the training process, we consider the weak IDs as negative samples of every predefined ID, which is intuitive and reasonable, and the results increase to 0.951. This is an improvement of approximately 2.5% over the previous results, which is the first major improvement in accuracy. Therefore, the weak-label boosting technique continues to be important, as it allows us to make full use of the weak IDs. According to the data provided in Table 2, it can be seen that the progressive training strategy, the feature-matching loss to the original classification loss, and four-fold cross-validation with class balancing increase the recognition accuracy of the model, and the effect is similar; all increase by approximately 0.005, approximately 0.52% better than the previous results.

Returning the essence of the classification task, the data and the problem of the classification task itself are the most important factors. Compared to traditional training methods, our two-stage progressive training technique focuses on the problem of few-shot data, and can avoid overfitting on the few-shot IDs due to the high sampling rate. Weak-label boosting regards all the unlabeled data as a new negative sample class, which allows the network to fully learn the key features of the training data. The above two methods focus more on the training data, allowing the network to better learn the key features for the classification task. At the same time, the instance-aware hard ID mining strategy, feature-mapping loss, and hard triplet loss are used to encourage both a smaller intraclass distance and a larger interclass distance, which is different from the methods focusing on data in principle. Pseudo labels, four-fold cross-validation with class balancing, and data augmentation are all based on the data itself and make full use of the training data. In the end, we found that the flipping equivariance method increased the accuracy by approximately 0.729%. This increase is the second highest increase in the accuracy of the model, by 0.007. Unlike general image classification tasks, a flipped image should be considered a new ID, which is a key factor that other competitors have not considered.

Table 3 lists the average pairwise distances between the centers. Note that the more discriminative the representation is, the larger the pairwise distances are. It can be seen that by introducing the global and local feature-mapping loss with an instance-aware hard ID mining strategy, our model yields a larger pairwise distance between the centers, indicating that the sample distribution is sparse.

CUB-2011 and Cars-196. Our accuracy is significantly higher than baseline methods, as shown in Table 4. This clearly proves our model’s powerful ability to discriminate subtle changes in order to recognize subordinate categories.

5.2. Experimental Results with Different Backbones

Figure 3 shows the change in accuracy when using different feature extractors as a backbone. The selected backbones are as follows: VGG16 won first and second places in the inspection and classification tasks, respectively, of the ImageNet Challenge of that year. The core idea of ResNet-50 and ResNet-101 is to use identity shortcut connections. Inception-V4 uses the combination of an inception module and residual connections, which greatly improves training speed and performance. From the figure, we can see that under the premise of using the same trick, when SENet-154 is used as a feature extractor the accuracy is generally the highest. The reason for this is that SENet-154 focuses on the relationship between channels. It is hoped that the model can automatically learn the importance of different channel characteristics. This finding proves the strong ability of SENet-154. In the figure, it can be seen that when using different feature extractors as backbones and then using each of the methods mentioned in the article, the accuracy of the model improves, and the increasing trend in the accuracy is consistent. This finding shows that our method has strong robustness.

5.3. Experimental Results with Different Loss Functions

Different loss functions can make the model more focused on learning the characteristics of a certain aspect of the data, and later on can better extract “unique” features by calculating the gradient in order to update the parameters. The correct loss function can make the predicted value more similar to the real value. When the predicted value is equal to the ground truth, the loss value is the smallest. Figure 4 shows how the score changes when the same trick is used for different losses. As seen from the figure, the classification loss with global and local feature-matching loss achieved the best results. The reason is that our loss encourages both a smaller intraclass distance and a larger interclass distance. It is critical to set a reasonable margin value in the loss, as this is an important indicator for measuring similarity. In short, as the margin value decreases, the loss tends to approach 0; however, it is difficult to distinguish similar images. The larger the margin value is, the more difficult it is for the loss value to approach 0 or even cause the network to not converge; however, the network can distinguish more similar images with greater certainty. The margin in our work here is 0.3.

The softmax loss is widely used in traditional classification tasks. The advantage of the softmax loss is that it is easier to converge and achieve one-shot encoding than with the hardmax loss. However, the softmax loss does not explicitly optimize feature embedding; that is, it while encourages the separation of different types of features, it does not encourage considerable feature separation. While the features obtained using the softmax loss divide the entire hyperspace or hypersphere according to the number of classifications in order to ensure that the categories are separable, the softmax loss does not require compactness and separation between categories. However, the vanilla softmax loss usually cannot be qualified for a classification task with more than 5000 classes, especially with small interclass variance and large intraclass variance problems. The core idea of contrastive loss is to randomly select two samples from the training samples [52]. If the two samples belong to the same class, this makes their distance as small as possible; otherwise, it makes them as far away as possible. Its shortcomings are obvious: it is necessary to specify a margin for each pair of nonhomogeneous samples, and this margin is fixed, which results in the fixed embedding space with no distortion. The center loss learns a center for each category and pulls all the feature vectors of each category into the corresponding category center [53]. It is based on the softmax loss, and is compact only within explicit constraints. However, the center loss needs to reserve a category center for each category. When the number of categories is large, the memory consumption is considerable.

5.4. Qualitative Results

In this part of the discussion, we provide visualization results in order to help better understand our model and the key methods mentioned in this article.

It is apparent that a severe unbalanced classes problem is present. Thus, we need to spend time applying transformations to observations in the smaller classes in order to create a more balanced dataset for fitting. We can significantly increase the size of our small classes by sequentially applying various transformations to our images. Of course, we must ensure that these transformations do not alter the availability of the fluke in the image, as this would defeat the purpose of making these transformations. Figure 5 shows the effects of data enhancements. The data enhancement methods include random cropping, random affine, random filter, random noise, random perspective, and red scaling. We implemented the affine transform, which is a combination of translation, rotation, and scaling. This transformation preserves parallel lines. The filter step is one filter selected randomly from histogram equalizers, contrast adjustments, Gaussian blurring, intensity rescale, etc. The projective transform, which is similar to looking at the object from a different perspective, does not preserve parallel lines. There is evidence of redscaling in the training and test dataset. Thus, our pictures should be robust to this scale. Transferring a current picture to redscale is relatively straightforward.

Figure 6 shows the classification results and confidence of several test examples. We deliberately selected different sets of data with small interclass variances and large intraclass variances. It can be seen that the model provides correct classification results.

6. Conclusions

In this paper, we propose a progressive training technique with weak-label boosting to take full advantage of few-shot IDs and weak IDs. We introduce an instance-aware hard ID mining strategy while designing a new classification loss to expand the decision margin. With our model and with certain tricks discussed in this paper we were able to solve the Kaggle challenge, which is a difficult fine-grained analysis problem involving unbalanced training data. Our experimental results on the CUB-2011 and Cars-196 databases further verify the effectiveness of our method. Our solution has been made available as an open source project. In the future, we plan to extend our method to work across different modalities by leveraging transfer learning beyond domain similarity.

Author Contributions

Methodology, Z.W. and J.H.; validation, Y.J.; investigation, H.L. and S.Z.; writing—original draft preparation, Y.J.; writing—review and editing, Z.W. and H.L.; visualization, B.T. and Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by China Geological Survey (Grant No. DD20190637), Beijing Municipal Science and Technology Commission (Grant No. Z201100008120005) and National Key Research and Development Program (Grant No. 2021YFC30000502).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sun, Y.Y.; Zhang, Y.; Zhou, Z.H. Multi-label learning with weak label. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, Atlanta, GA, USA, 11–15 July 2010; pp. 593–598. [Google Scholar]
Bucak, S.S.; Jin, R.; Jain, A.K. Multi-label learning with incomplete class assignments. In Proceedings of the CVPR 2011, Colorado, CO, USA, 20–25 June 2011; pp. 2801–2808. [Google Scholar]
Yang, S.J.; Jiang, Y.; Zhou, Z.H. Multi-instance multi-label learning with weak label. In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, Beijing, China, 3–9 August 2013. [Google Scholar]
Huang, S.J.; Chen, J.L.; Mu, X.; Zhou, Z.H. Cost-Effective Active Learning from Diverse Labelers. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017; pp. 1879–1885. [Google Scholar]
Huang, Z.; Wang, X.; Wang, J.; Liu, W.; Wang, J. Weakly-supervised semantic segmentation network with deep seeded region growing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7014–7023. [Google Scholar]
Zhan, W.; Zhang, M.L. Inductive semi-supervised multi-label learning with co-training. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 1305–1314. [Google Scholar]
Kim, J.; Kim, T.; Kim, S.; Yoo, C.D. Edge-labeling graph neural network for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 11–20. [Google Scholar]
Wang, T.; Zhang, X.; Yuan, L.; Feng, J. Few-shot adaptive faster r-cnn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7173–7182. [Google Scholar]
Munkhdalai, T.; Yu, H. Meta networks. Proc. Mach. Learn. Res. 2017, 70, 2554. [Google Scholar] [PubMed]
Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D. Matching networks for one shot learning. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 3630–3638. [Google Scholar]
Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 4077–4087. [Google Scholar]
Ravi, S.; Larochelle, H. Optimization as a model for few-shot learning. In Proceedings of the ICLR 2017 Conference Track 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv 2017, arXiv:1703.03400. [Google Scholar]
Zhang, Z.; Sabuncu, M.R. Generalized cross entropy loss for training deep neural networks with noisy labels. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 2–8 December 2018. [Google Scholar]
Liu, W.; Wen, Y.; Yu, Z.; Li, M.; Raj, B.; Song, L. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 212–220. [Google Scholar]
Wang, H.; Wang, Y.; Zhou, Z.; Ji, X.; Gong, D.; Zhou, J.; Li, Z.; Liu, W. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5265–5274. [Google Scholar]
Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The Caltech-Ucsd Birds-200-2011 Dataset; California Institute of Technology: Pasadena, CA, USA, 2011. [Google Scholar]
Krause, J.; Stark, M.; Deng, J.; Fei-Fei, L. 3D object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Sydney, NSW, Australia, 2–8 December 2013; pp. 554–561. [Google Scholar]
Fu, J.; Zheng, H.; Mei, T. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4438–4446. [Google Scholar]
Kong, S.; Fowlkes, C. Low-rank bilinear pooling for fine-grained classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 365–374. [Google Scholar]
He, X.; Peng, Y. Fine-grained image classification via combining vision and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5994–6002. [Google Scholar]
Lam, M.; Mahasseni, B.; Todorovic, S. Fine-grained recognition as hsnet search for informative image parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2520–2529. [Google Scholar]
Cai, S.; Zuo, W.; Zhang, L. Higher-order integration of hierarchical convolutional activations for fine-grained visual categorization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 511–520. [Google Scholar]
Zheng, H.; Fu, J.; Mei, T.; Luo, J. Learning multi-attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5209–5217. [Google Scholar]
Bao, J.; Chen, D.; Wen, F.; Li, H.; Hua, G. CVAE-GAN: Fine-grained image generation through asymmetric training. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2745–2754. [Google Scholar]
Dubey, A.; Gupta, O.; Guo, P.; Raskar, R.; Farrell, R.; Naik, N. Pairwise confusion for fine-grained visual classification. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 70–86. [Google Scholar]
Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4690–4699. [Google Scholar]
Liu, Y.; Jin, R.; Yang, L. Semi-supervised multi-label learning by constrained non-negative matrix factorization. AAAi 2006, 6, 421–426. [Google Scholar]
Kong, X.; Ng, M.K.; Zhou, Z.H. Transductive multilabel learning via label set propagation. IEEE Trans. Knowl. Data Eng. 2011, 25, 704–719. [Google Scholar] [CrossRef] [Green Version]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Hinton, G.E.; Krizhevsky, A.; Sutskever, I. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1106–1114. [Google Scholar]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Chen, Y.; Li, J.; Xiao, H.; Jin, X.; Yan, S.; Feng, J. Dual path networks. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 4467–4475. [Google Scholar]
Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 480–496. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Hermans, A.; Beyer, L.; Leibe, B. In defense of the triplet loss for person re-identification. arXiv 2017, arXiv:1703.07737. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in pytorch. In Proceedings of the NIPS 2017 Autodiff Workshop The Future of Gradient-Based Machine Learning Software and Techniques, Long Beach, CA, USA, 9 December 2017. [Google Scholar]
Wang, Y.; Choi, J.; Morariu, V.; Davis, L.S. Mining discriminative triplets of patches for fine-grained classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1163–1172. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Luo, W.; Zhang, H.; Li, J.; Wei, X.S. Learning semantically enhanced feature for fine-grained image classification. IEEE Signal Process. Lett. 2020, 27, 1545–1549. [Google Scholar] [CrossRef]
Chang, D.; Ding, Y.; Xie, J.; Bhunia, A.K.; Li, X.; Ma, Z.; Wu, M.; Guo, J.; Song, Y.Z. The devil is in the channels: Mutual-channel loss for fine-grained image classification. IEEE Trans. Image Process. 2020, 29, 4683–4695. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Valev, K.; Schumann, A.; Sommer, L.; Beyerer, J. A systematic evaluation of recent deep learning architectures for fine-grained vehicle classification. In Proceedings of the Pattern Recognition and Tracking XXIX. International Society for Optics and Photonics, Orlando, FL, USA, 18–19 April 2018; Volume 10649, p. 1064902. [Google Scholar]
Zhou, M.; Bai, Y.; Zhang, W.; Zhao, T.; Mei, T. Look-into-object: Self-supervised structure modeling for object recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11774–11783. [Google Scholar]
Cui, Y.; Song, Y.; Sun, C.; Howard, A.; Belongie, S. Large scale fine-grained categorization and domain-specific transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4109–4118. [Google Scholar]
Hu, T.; Qi, H.; Huang, Q.; Lu, Y. See better before looking closer: Weakly supervised data augmentation network for fine-grained visual classification. arXiv 2019, arXiv:1901.09891. [Google Scholar]
Liu, C.; Xie, H.; Zha, Z.; Yu, L.; Chen, Z.; Zhang, Y. Bidirectional attention-recognition model for fine-grained object classification. IEEE Trans. Multimed. 2019, 22, 1785–1795. [Google Scholar] [CrossRef]
Zhang, F.; Li, M.; Zhai, G.; Liu, Y. Multi-branch and multi-scale attention learning for fine-grained visual categorization. In Proceedings of the International Conference on Multimedia Modeling, Prague, Czech Republic, 22–24 June 2021; pp. 136–147. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv 2016, arXiv:1602.07261. [Google Scholar]
Sun, Y.; Liang, D.; Wang, X.; Tang, X. Deepid3: Face recognition with very deep neural networks. arXiv 2015, arXiv:1502.00873. [Google Scholar]
Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A discriminative feature learning approach for deep face recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 499–515. [Google Scholar]

Figure 1. Data examples: although the poses of the two whales in (a) are different, they belong to the same category. The other two whales in (b) are not identical, although they look similar. The same phenomenon exists with respect to the vehicles in the Cars-196 database.

Figure 2. Model overview. Our model consists of two key modules, an image feature extraction module and a description module. The final predicted ID scores are obtained by the global descriptor without bias through the FC layer and a sigmoid function.

Figure 3. Experimental results with different backbones: VGG-16 [50], ResNet-50, ResNet-101, Inception-V4 [51] and SENet-154 were selected as different backbones. The networks that perform better in semantic segmentation score higher in the classification task of this article. SENet-154 achieved the best results. With the same features used under different backbones, the upward trend of the scores is consistent and the contributions of weak-label boosting and progressive training are at a relatively high level, which proves the robustness of our method.

Figure 4. Experimental results with different loss functions. We selected four different loss functions: the softmax loss, which is commonly used in multiclassification tasks, the hardmax loss, the contrastive loss [52], and the center loss [53]. The classification loss with global and local feature-matching loss achieved the best results.

Figure 5. For a certain original picture, we randomly adopted several data enhancement methods. The execution order and number of executions of each method were random, and several data enhancement results could therefore be obtained. Here, we show one original image and five randomly-obtained images with corresponding data enhancement.

Figure 6. We randomly selected test data from the competition and the other two widely-used datasets, from which it can be seen that our method provides the correct ID.

Table 1. Details of the training dataset provided by the competition; it can be seen that most of the samples are few-shot IDs and weak IDs.

ID Class	Number of IDs	Number of Samples
Normal IDs	273	4814
Few-shot IDs	4731	10,883
Weak IDs	1	9664
Total	5005	25,361

Table 2. Leaderboard scores. The private leaderboard is calculated with approximately 80% of the test data. Starting with a vanilla SENet-154 backbone, several training modules are integrated gradually. With all the processing modules, we were ultimately able to reach a score of 0.973, exceeding all other teams.

Feature	Leaderboard Score
Vanilla SENet-154	0.927
+ label boosting	0.951
+ progressive training	0.956
+ feature-mapping loss	0.960
+ flipping equivariance	0.967
+ 4-fold cross-validation	0.972
+ post processing modules	0.973

Table 3. Average pairwise distances between the centers (row vectors of the last fc layer). This distance indicates how sparsely these centers are distributed. The more discriminative the representations, the larger the average pairwise distance.

Loss Function	Average Pairwise Distance
Hardmax loss	0.903326
Softmax loss	0.927687
Contrastive loss	0.991123
Center loss	0.993024
Feature-mapping loss	0.997201

Table 4. Accuracy (%) comparison with the recent top-five state-of-the-art approaches on Cars-196 and CUB-2011. Our accuracy is significantly higher than baseline methods.

Cars-196		CUB-2011
Method	ACC	Method	ACC
OPAM-2016 [40]	92.2	LIO-2020 [45]	88.0
DeiT-B-2020 [41]	93.3	DSTL-2018 [46]	89.3
SEF-2020 [42]	94.0	DAN-2019 [47]	89.4
MC Loss-2020 [43]	94.4	BARM-2019 [48]	89.5
Densenet161-2018 [44]	94.6	TBMSL-Net-2020 [49]	89.6
Ours	94.9	Ours	90.1

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jin, Y.; Wang, Z.; Liao, H.; Zhu, S.; Tong, B.; Yin, Y.; Huang, J. Progressive Training Technique with Weak-Label Boosting for Fine-Grained Classification on Unbalanced Training Data. Electronics 2022, 11, 1684. https://doi.org/10.3390/electronics11111684

AMA Style

Jin Y, Wang Z, Liao H, Zhu S, Tong B, Yin Y, Huang J. Progressive Training Technique with Weak-Label Boosting for Fine-Grained Classification on Unbalanced Training Data. Electronics. 2022; 11(11):1684. https://doi.org/10.3390/electronics11111684

Chicago/Turabian Style

Jin, Yuhui, Zuyun Wang, Huimin Liao, Sainan Zhu, Bin Tong, Yu Yin, and Jian Huang. 2022. "Progressive Training Technique with Weak-Label Boosting for Fine-Grained Classification on Unbalanced Training Data" Electronics 11, no. 11: 1684. https://doi.org/10.3390/electronics11111684

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Progressive Training Technique with Weak-Label Boosting for Fine-Grained Classification on Unbalanced Training Data

Abstract

1. Introduction

2. Related Works

2.1. Fine-Grained Image Classification

2.2. Few-Shot Learning

2.3. Weak Labels

3. Methods

3.1. Model

3.1.1. Feature Extractor

3.1.2. Feature Descriptor

3.2. Progressive Training

3.3. Weak-Label Boosting

3.4. Loss Function

3.4.1. Instance-Aware Hard ID Mining

3.4.2. Feature-Mapping Loss

3.5. Other Tricks

3.6. Implementation Details

4. Experimental Settings

4.1. Dataset

4.2. Evaluation Methodology

4.3. Baselines

5. Results and Discussions

5.1. Quantitative Results and Analysis

5.2. Experimental Results with Different Backbones

5.3. Experimental Results with Different Loss Functions

5.4. Qualitative Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI