Loss Architecture Search for Few-Shot Object Recognition

Department of Geomatics Engineering, School of Traffic & Transportation Engineering, Changsha University of Science & Technology, Changsha, China Institute of Remote Sensing and Geographic Information System, Peking University, Beijing, China School of Geosciences and Info-Physics, Central South University, Changsha, China China Nonferrous Metal Changsha Survey and Design Institute, Changsha, China


Introduction
e object recognition problem under conventional supervised learning has been thoroughly studied with many successful models available, especially deep convolutional neural networks (DCNNs), including VGG [1], GoogLeNet [2], ResNet [3], and NASNet [4]. DCNNs can achieve impressive performance on large datasets like ImageNet [5] and Open Images [6]. However, in order to train a DCNN model, large amounts of labeled training data and training time are typically needed. Some works use pretrained DCNNs as off-the-shelf deep feature extractors [7]. While this procedure can reduce training time, it still requires lots of labeled images to fine-tune the models. is is in contrast with how a human vision recognition system works: a person does not need to see thousands of instances of an object, but only a small number of them, to remember and generalize for the recognition of the object later [8].
Motivated by the high capability of human vision recognition system, one-/few-shot learning has attracted considerable attention from the computer vision community, including those on one-shot or few-shot learning for object recognition [9][10][11][12][13][14], object detection [15], image segmentation [16], image captioning [17], face identification [18], and person reidentification [19]. Different from the standard supervised learning, few-shot learning aims to solve the problem where only a few labeled samples are available for training, while its extreme scenario, called one-shot learning, handles the more challenging situation where only one instance per class is available. As training a classifier using only one or a few instances per class is difficult to achieve a desired result, studies on one-/few-shot object recognition have focused on training it with a set of well-labeled data and then generalizing it to new classes [10,20].
For one-/few-shot object recognition, due to the scarcity of the labeled examples from each new class, a natural metric learning scheme is adopted in [21], which aims to learn object representation and categorize instances based on the absolute similarities between pairs of instances. is absolute similarity-based method, however, does not pay enough attention to the interclass and intraclass correlation, which may not result in a satisfactory accuracy.
Instead of absolute similarity, the scheme of relative similarity-based classification can have higher capability in decreasing the intraclass variations and increasing the interclass variations [22]. While it is difficult to separate or group objects only based on an absolute similarity measure, relative similarities can conveniently deal with these variations by requiring the intraclass similarity to be larger than the interclass similarity. Figure 1 shows an example of the advantage of relative similarity.
ere have been many metric learning methods based on relative similarity [23][24][25][26]. However, different metric learning losses can lead to different performances and there has been no rule of thumb to design the best loss for a dataset.
Hence, in this paper, we develop a novel object representation-based metric learning method that induces suitable relative similarities with a particular loss generated by loss architecture search (LAS), inspired by the neural architecture search (NAS) [4,27]. LAS, as shown in Figure 2, is a gradient-based method for finding good loss architectures. We use a recurrent neural network as the loss function architecture generator to generate a suitable loss for training the embedding network. e generator is updated with a reward signal, which measures how good the metric learning loss is based on the relative similarity. To obtain the reward signal, the following two steps are carried out. First, we train the embedding network with the generated loss L using the training set. Second, we use the few-shot instances from the validation set to fine-tune the embedding network to obtain the accuracy R. With this accuracy as the reward signal, we can compute the policy gradient (i.e., the gradient of P ) by proximal policy optimization (PPO) [28] to update the loss generator. As a result, in the next iteration, the generator gives higher probabilities to those losses that drive the embedding network to obtain higher accuracies. In other words, the generator learns to improve its search over time. After obtaining the final loss function L * with the highest validation accuracy, the few-shot object recognition system is completed by updating the embedding network with the training set and the testing set (see the right part of Figure 2). e proposed approach trains the object embedding network on the well-labeled training set under a loss function generated by LAS, such that the instances from new classes can be categorized based on their similarities with the few-shot instances in the learned embedding space. Our main contributions are threefold: (1) We create a search space for automatically determining the hyperparameters of metric learning losses, including their magnitudes, margins, and angle (2) We present a few-shot recognition system where a proximal policy optimization algorithm is used to find the best loss architecture sampled from the search space such that the embedding network yields the highest validation accuracy (3) Our few-shot object recognition system with LAS by reinforcement learning obtains the best results compared with recent state-of-the-art methods

Few-Shot Learning.
Recent works for few-shot learning have adopted different strategies in different domains [8,9,11,18,21,[29][30][31][32][33], among which the most relevant ones are the metric learning-based methods [8,21]. e Siamese network developed in [21] aims to learn absolute similarity between pairs of objects. It enlarges or lessens the similarity between a pair of samples according to whether they are from the same class. e authors of [8] propose a matching network which can also be considered as a metric learningbased method. is approach minimizes the cosine distance based-Siamese loss between the embeddings of instances with a bidirectional Long Short-Term Memory (bi-LSTM).
Another line of research uses metalearning for few-shot learning, which trains a metalearner from many relevant tasks. In [29], model agnostic metalearning is proposed to train a model that can quickly adapt to a new task with limited training data. e authors of [9] propose an LSTMbased metalearner model to train another learner in the fewshot regime. Another metalearning-based method is proposed in [30] which learns to fast parameterize an underlying neural network by employing metainformation. e memory-augmented neural network is adopted in [31] for few-shot learning. It enables each instance to be encoded and retrieved efficiently by a memory module. is module consists of key-value pairs where keys and values are the embeddings and labels of examples, respectively. Another work from [32] also uses a memory module for training, with a bi-LSTM to predict the parameters of the embedding model. e authors of [33] propose a task agnostic approach to improve the generalizability of few-shot metalearning. In addition to the research mentioned above, a few other works generate synthetic data for learning [11,18]. Different from these previous methods, we design our few-shot object recognition system in the reinforcement learning framework, which automatically determines the optimal loss function for training the embedding network.

Metric
Learning. Metric learning aims to learn a similarity function from examples, which measures the similarity between object pairs. Recently, quite a few methods have been proposed for metric learning. Some works [34][35][36] adopt pairwise constraints to train the models. As discussed in Section 1, these models' capacity is limited due to the use of absolute similarity. Most recent works [23,26,37] use DCNNs as embedding functions and use triplet-based constraints instead of pairwise constraints to capture similarities between objects. eir results show that the combination of a deep embedding model with a tripletbased constraint is effective in learning similarity functions.

Complexity
Nevertheless, the problems solved in these works still fall into the conventional supervised learning setting where a set of well-labeled data are needed. In our work, instead of manually designing a metric learning loss and training with a large amount of labeled data, we use LAS for metric learning, which can automatically determine the hyperparameters of relative similarity-based losses such as margin, magnitude, and angle.

AutoML.
e success of deep learning in computer vision tasks is largely due to its automation of the feature extraction process; hierarchical feature extractors are learned in an end-to-end fashion from data rather than being manually designed. To obtain better performance and handle more difficult problems, researchers have manually developed more and more complex networks. To automate the design of network architectures, neural architecture search [4,27] is thus a logical next step in machine learning. NAS can be seen as a subfield of AutoML. e work form [38] proposed a method to automatically determine the parameters of softmax-based loss architecture by using AutoML. In our work, we propose LAS for finding the best architecture of the metric learningbased loss function for few-shot learning, which can be seen as another subfield of AutoML. We believe that combining NAS and LAS will make AutoML go further and this is the future work.

Loss Architecture Search
We formulate the problem of finding the best loss architecture as a discrete search problem. In our search space, a loss architecture consists of five subloss architectures, each of which has two hyperparameters: (1) the magnitude of the operation and (2) the margin or angle of the subloss.

Losses.
We choose five promising losses to construct the search space: Hard Mining Triplet Loss (HMTL) [23], Margin Sample Mining Loss (MSML) [24], Angular Loss (AL) [25], Triplet Center Loss (TCL) [26], and Quadruplet Center Loss (QCL). e first four are from the metric learning community, which achieve state-of-the-art results in few-shot learning, face recognition, vehicle Re-ID, person Re-ID, and so forth. e last one, QCL, is proposed in this paper and is a complement to the first four losses.

Hard Mining Triplet Loss.
e triplet loss is based on a relative distance comparison among triplet instances. Specifically, for any triplet ( a and x (i) p are two distinct samples of the same class and x (i) n comes from Relative similarity: Anchor: vs. vs.
(c) Figure 1: An example that shows the advantage of relative similarity. Left: nine objects from three different classes. e lemons and oranges are similar in appearance. Middle: a pairwise similarity-based model may be confused by a pair of objects from the lemon and orange classes.
Right: if a model learns relative similarity between objects, it may not be confused since the anchor object (in the blue box) is relatively similar to the lemon compared to the orange.
Train the embedding network with the final best loss architecture L * using the training set Loss architecture search with reinforcement learning Few-shot object recognition Fine-tune the embedding network using the few-shot instances sampled from the testing set Train the embedding network with the loss architecture L using the training set Fine-tune the embedding network using the few-shot instances sampled from the validation set to obtain the accuracy R Generate a loss architecture L with probability P using generator G Compute the gradient of P and scale it by R to update G L * Complexity another class as a negative example, their embedding vectors are generated by an embedding function f ϕ such that . e relative distance constraint of each triplet is defined as where m is a predefined constant parameter representing the minimum margin between the matched and mismatched pairs, D(x, y) is a metric function that measures distance between vectors x and y, and i denotes the i-th triplet. In order to ensure fast convergence, it is crucial to select triplets that violate the triplet constraint in equation (1). is means that, given . For each sample (as an anchor) from a batch of size M, we select a hard-positive sample and a hard-negative sample (also from the batch) to construct a triplet. us, the HTML function can be defined as (2)

Margin Sample Mining
Loss. e quadruplet loss (QL) [22] extends the triplet loss by adding a different negative pair 1 . A quadruplet contains four different instances a and x (i) p are samples of the same class, while x (i) n 1 and x (i) n 2 are samples of two other classes. e quadruplet loss is formulated as where m 1 and m 2 are two margins for the two terms, respectively, and N QL stands for the number of quadruplets. e first term is the same as the triplet loss. e second term tries to enforce intraclass distances to be smaller than interclass distances [22]. Based on QL, a strategy named margin sample mining is proposed [24]. It picks the most dissimilar positive pair (x a , x p ) and the most similar negative pair (x n 1 , x n 1 ) in the whole batch and formulates its loss as As shown in equation (4), the constraints of MSML are extremely sparse. Only two pairs in a batch are used to compute the gradient of the embedding network in the training phase, and it seems that it wastes a lot of training data. In fact, the two chosen pairs are determined by all the data in one batch. During training, more and more pairs will be selected [24].
n whose sides are denoted as s ap , s an , and s pn . e original triplet constraint equation (1) enforces that s an is longer than s ap . Because the anchor and positive samples are of the same class, a symmetrical triplet constraint can be derived, which enforces According to the cosine formula, it can be proved that the angle ∠x (i) a x (i) n x (i) p surrounded by the longer edges s an and s pn has to be the smallest one; that is, p has to be less than 60°, which can be used to constrain the upper bound ∠x (i) a x (i) n x (i) p for each triplet triangle [25]: where α is a predefined parameter. However, a straightforward implementation of equation (5) becomes unstable in some special cases [25]. In order to solve this problem, the angular loss function is defined in the following equation: where N AL is the number of triplets, which is also equal to the batch size M in our implementation.

Triplet Center
Loss. e goal of TCL is to leverage the advantages of the triplet loss and the center loss [39], that is, to effectively decrease the intraclass distances and increase the interclass distances simultaneously. Let a given training batch (x i , y i ) M i�1 consist of M samples x i with the associated labels y i . In TCL, it is assumed that the features of each class share one center. us, the TCL is defined as where c y i is the center of class y i and c n is the negative center nearest to c y i .

Quadruplet Center Loss.
In equation (7), TCL aims to make the distance between the anchor and its corresponding center c y i smaller than the distance between the anchor and its nearest negative center c n . For example, in Figure 3(a), the anchor can be pushed to its center. However, in some cases such as the one in Figure 3(b), it cannot, even though it is still far away from its center. In order to solve this problem, we propose a quadruplet center loss: where (c 1 , c 2 ) is a pair of centers. us, the proposed loss function guarantees that an anchor that is not close enough to its center will be moved closer, while an anchor that is already close enough to its center will be neglected.

Search Space.
Our key insight of constructing a search space and finding an optimal loss from it automatically lies in that (1) designing a loss architecture requires a lot of expert knowledge; (2) the available losses are all task-and dataset-dependent; (3) there is no rule of thumb to choose the hyperparameters for each loss. erefore, we use the five loss architectures to construct the search space. Each loss comes with a magnitude. We discretize the range of the magnitudes into 11 values (uniform spacing) from 0.0 to 1.0 with a step of 0.1 so that we can use a discrete search algorithm to find them. Similarly, we also discretize the margin of each loss having a margin into 51 values (uniform spacing) from 0.0 to 5.0 with a step of 0.1. For the angular loss, we discretize the angle into 51 values (uniform spacing) from 10.0°to 60.0°with a step of 1.0°. Finding an optimal loss architecture now becomes a search problem in a space of (11 × 51) 5 ≈ 5.6 × 10 13 possibilities. Notice that there is no explicit discard action in our search space; this action is implicit and can be achieved by making the magnitude of a subloss architecture be 0.

Search Algorithm.
e search algorithm we use to find the optimal loss architecture is a reinforcement learning algorithm (specifically, the proximal policy optimization (PPO)), inspired by [4,27]. e loss generator is a one-layer LSTM with 100 hidden units. As shown in Figure 4, at each step, the generator takes an action according to the maximum probability produced by a softmax; the output is then fed into the next step. In total, the generator has 10 softmax layers in order to predict 5 subloss architectures, each with 2 actions (selecting a magnitude and a margin/angle).
As shown in Figure 2, the generator is trained with a reward signal, which measures how good the loss architecture is at improving the accuracy of the object embedding network. After generating a loss architecture L with probability P, the embedding network is trained on the training set and fine-tuned using the few-shot instances. It is then evaluated on the validation set to measure the validation accuracy (the reward signal). At the end of the search, we obtain the best loss architecture L * with the highest accuracy on the validation set, which is then used to optimize the embedding network finally.

Experiments and Results
To evaluate the effectiveness of the proposed approach, we conduct experiments on three popular datasets for few-shot learning, CIFAR-100 [40], Omniglot [41], and mini-ImageNet [9]. Omniglot dataset is the most popular few-shot object recognition benchmark with handwritten characters, and mini-ImageNet dataset is a subset of ImageNet [5] released recently.

CIFAR-100 Dataset.
e CIFAR-100 dataset [40] contains 60,000 32 × 32 images with 100 classes. We use 60, 20, and 20 classes for training, validation, and testing, respectively. e validation set is used to generate validation accuracies as the reward signals for LAS.

Omniglot Dataset.
is dataset contains 32,460 images of handwritten characters, with 1,623 different characters within 50 alphabets. We follow the most common split in [8], splitting the dataset into a background set of 1200 classes and a testing set of 423 classes. We further split the background set into a training set of 800 classes and a validation set of 400 classes.

mini-ImageNet Dataset.
e mini-ImageNet dataset is proposed by [9] as a benchmark with images of much higher resolutions and complexity. is dataset contains 100 classes randomly sampled from the ImageNet dataset, and each class contains 600 images. It is further split into a training set of 64 classes, a validation set of 16 classes, and a testing set of 20 classes [9].

Evaluation Setting.
Typically, a few-shot learning task with N new classes and k instances per class is referred to as an N-way k-shot task. In this paper, all experiments are for N-way k-shot object recognition, and the evaluation setting is similar to those in other compared methods. For each iteration in LAS (see the left part of Figure 2), we train the embedding network with the generated loss L using the training set. For evaluating L, we randomly select the k-shot instances as a support set consisting of N classes with k-labeled images per class from the validation set for fine-tuning the embedding network. en, we randomly select 15 images (disjoint with the support set) within the selected N classes from the validation set to evaluate the fine-tuned embedding network by 1-Nearest Neighbor (1NN). We repeat such evaluation procedures 5 times for each support set and use the mean accuracy as the final validation accuracy (i.e., the reward signal) corresponding to L.
After obtaining the final loss architecture L * for one dataset, we use the testing procedure to obtain the final accuracy (see the right part of Figure 2). We first train the embedding network with L * using the same training set. In the testing stage, we fine-tune the embedding network using a randomly selected support set from the testing set. After that, we randomly select 15 images (disjoint with the support set) within the selected N classes and then measure the classification accuracy by 1NN. To make the accuracy more convincing, we repeat such a testing procedure 600 times and report the final mean accuracy as well as the 95% Confidence Intervals (CIs).

Network Architecture and Parameter Setting.
Our embedding network follows the same architecture as that used in [8,9], which has four convolutional layers (Conv-4). Each convolutional layer is designed with 3 × 3 convolution with 64 filters followed by batch normalization, ReLU nonlinearity, and 2 × 2 max pooling.
On each dataset, the generator samples 5,000 loss architectures. We follow the training procedure and hyperparameters from [4] for training the generator. For training the embedding network, we use Adam [42] to perform stochastic optimization over the learning objective. e hyperparameters of Adam (i.e., β 1 , β 2 , and ε) are set to 0.9, 0.999, and 10 − 8 , respectively.

Ablation Study.
Since LAS integrates the five sublosses into the reinforcement learning framework, we conduct an ablation study on CIFAR-100 to understand the contribution of each subloss. In each case, we remove one subloss from the search space and perform the whole process of loss architecture search as shown in Figure 2. e results are presented in Table 1.
From Table 1, we can see that removing any one subloss results in an obvious performance drop compared with using all of them. is implies that the five losses are not redundant. In addition, removing QCL, the loss proposed in this paper, causes the most significant performance reduction on this dataset, which shows that QCL may contribute the most among the five losses.

Compared Methods.
To verify the performance of our LAS for few-shot learning, we compare it with the following 19 state-of-the-art methods: (1) Siamese Network (SN) [21], which presents a strategy for performing few-shot classification by learning a deep convolutional embedding network with a pairwise Siamese loss, (2) Matching Network (MN) [8], which performs few-shot learning by embedding a small labeled support set and unlabeled examples, (3) Matching Network with Full Contextual Embeddings (MN-FCE) [8], which is an upgraded version of MN by utilizing a bi-LSTM to contextually embed samples, (4) Model Agnostic Metalearning (MAML) [29], which learns hyperparameters through gradient descent in a metalearning fashion, (5) Metalearner LSTM (ML-LSTM) [9], which uses a LSTM as a metalearner to learn the parameter update rule for  6 Complexity optimizing the network, (6) Siamese with Memory (SM) [31], which proposes a large-scale life-long memory module to remember past training samples and makes predictions based on stored previous samples, (7) Metanetwork (Meta-N) [30], which learns metalevel knowledge across tasks for rapid generalization, (8) Meta-Stochastic Gradient Descent (Meta-SGD) [43], which learns the initialization, update direction, and learning rates by metalearning, (9) Prototypical Networks (PN) [10], which learn a prototype representation for each class, (10) Relation Network (RN) [20], which computes relation scores between few-shot instances and testing instances, (11) Memory Matching Network (MM-Net) [32], which augments CNNs with memory and learns the network parameters for unlabeled images, (12) MM-Net − [32], which is a variant of MM-Net by using a mixed training strategy, (13) Graph Neural Network (GNN) [44], which casts few-shot learning as a message passing task, (14) Task Agnostic Metalearning (TAML) [33], which learns an unbiased initial model with the largest uncertainty, (15) task-dependent adaptive metric (TADAM) [45], which proposes a practical end-to-end optimization procedure to learn a task-dependent metric space, (16) Simple Neural AttentIve Learner (SNAIL) [46], which proposes a generic metalearner architecture for few-shot learning, (17) Deep Nearest Neighbor Neural Network (DN4) [14], which replaces an image-level feature-based measure with a local feature-based image-to-class measure, and (18) deep learning with knowledge transfer architecture (KTN) [47], which jointly incorporates classifier learning, knowledge inferring, and visual feature learning into one framework. Note that, for fair comparison, the backbones of LAS and all these compared methods are of the same architecture as that used in [8,9], which is a shallow network with only 4 convolutional layers. Table 2 compares LAS with 4 state-of-the-art methods (MN, SM, MAML, and Meta-N) and the 5 sublosses described in the last section. e codes of these 4 methods are available publicly and thus can be used to run this experiment. e results across the 5-way 1-shot task and the 5-way 5-shot task show that our LAS method obtains the best performances against these methods. In particular, the 1-shot and 5-shot accuracy of our method can achieve 55.85% and 71.91%, respectively, on 5-way learning, making the absolute improvement over MAML by 6.56% and 5.87%. Table 2 also shows the performances of 5 metric learning approaches for few-shot object recognition (HMTL [23], MSML [24], AL [25], TCL [26], and QCL).

Results on CIFAR-100.
e results indicate that no single subloss dominates the loss searched by LAS. Table 3 compares LAS with 8 state-of-the-art methods on the Omniglot dataset. e results of these methods are from [14,32,33]. e results across the 20-way 1-shot task and the 20-way 5-shot task indicate that our LAS method again achieves the best performances against other state-of-the-art techniques including embedding models (SN, MN, SM, PN, and MM-Net) and metalearning approaches (Meta-N, MAML, and TAML). In particular, the 1-shot and 5-shot accuracy of LAS can reach 97.69% and 99.21% on 20-way learning, respectively. It makes the absolute improvement over MAML by 1.89% and 0.31%, which is significant on this dataset.

Results on mini-ImageNet.
e performance comparison with 14 state-of-the-art methods on mini-ImageNet is summarized in Table 4. e results of these methods are from [14,20,32,33]. Our LAS method also performs the best. In particular, the 1-shot and 5-shot accuracy can achieve 54.97% and 71.92% on 5-way learning, respectively, making the absolute improvement over MM-Net by 1.60% and 4.95%. Compared with the most recent work DN4, LAS improves the accuracy of the two tasks by 3.73% and 0.90%, respectively. Our LAS gains 4.53% and 6.60% improvements over RN. RN computes relation scores between few-shot instances and the testing instances, while LAS finds the best metric learning-based loss to generate the best embedding functions.
To evaluate the proposed method on deeper networks, several experiments on mini-ImageNet are conducted by using ResNet-12 as the backbone. ResNet-12 is a 12-layer residual network. e performance comparison with ResNet-12 as the backbone on mini-ImageNet is summarized in Table 5.
e results across the 5-way  1-shot task and the 5-way 5-shot task show that the proposed method achieves the best performances against other state-of-the-art techniques. In particular, the 1-shot and 5-shot accuracy of LAS with ResNet-12 as the backbone can reach 59.44% and 77.82% on 5-way learning, respectively. It makes the absolute improvement over MAML by 9.08% and 10.48%, which is significant on this dataset.

Training and Inference
Time. Similar to other reinforcement learning algorithms, the training of LAS is time-consuming. On Nvidia Tesla P100 GPU×2, LAS takes about two days to find the best loss architecture on the mini-ImageNet dataset for the 5-way 5-shot task. However, after training, its inference is fast enough; for example, it only takes 2 ms to perform one inference for the same task.

Conclusion
In this paper, we have proposed an automatic approach to best loss architecture search for few-shot object recognition. Five metric learning-based sublosses (with one developed by us) are used to construct the search space. e loss generator is trained by a reinforcement learning algorithm. Our experiments show that the proposed few-shot object recognition method outperforms other state-of-the-art methods on three popular benchmarks. e future work includes combining network architecture search and loss architecture search for better AutoML.
Data Availability e datasets and source codes used to support the findings of this study are available from the author upon request via e-mail (jyue@pku.edu.cn).