Self-reconstruction network for fine-grained few-shot classification

Metric-based methods are one of the most common methods to solve the problem of few-shot image classification. However, traditional metric-based few-shot methods suffer from overfitting and local feature misalignment. The recently proposed feature reconstruction-based approach, which reconstructs query image features from the support set features of a given class and compares the distance between the original query features and the reconstructed query features as the classification criterion, effectively solves the feature misalignment problem. However, the issue of overfitting still has not been considered. To this end, we propose a self-reconstruction metric module for diversifying query features and a restrained cross-entropy loss for avoiding over-confident predictions. By introducing them, the proposed self-reconstruction network can effectively alleviate overfitting. Ex-∗


Introduction
Deep learning has achieved impressive performance in computer vision.
However, deep networks usually demand numerous labeled data for training, which is impractical in many tasks where data acquisition is costly and timeconsuming.For this reason, researchers in the computer vision community turn considerable attention to few-shot learning in recent years, especially for few-shot classification [1,2,3].
The goal of few-shot classification is to recognize unseen query sample with very limited (often less than 10) labeled support samples.Existing methods usually can be categorized into three classes, metric-based methods [4], optimization-based methods [5], and transfer learning-based methods [6,7].Among these methods, metric-based methods are relatively simple but effective, achieving state-of-the-art performance in many few-shot tasks.
Metric-based methods usually adopt episodic training strategy to train a feature extractor with a fixed distance metric or a parameterized distance metric, and then fix them, i.e., without fine-tuning, to classify unseen novel query samples.However, early works, e.g., prototypical network [8] and relation network [9], mainly build on global features, which suffer from inaccurate Figure 1: Motivation of the self-reconstruction network.FRN [13] encounters the overfitting problem as its training loss keeps decreasing but its validation loss starts to increase after around 400 epochs.In contrast, the validation loss of our method stays stable after 400 epochs, demonstrating its effectiveness in addressing overfitting.
similarity measure between two samples due to the mismatch of key information in images.This is particularly detrimental to few-shot fine-grained image classification, as sub-categories have subtle differences and the valuable and discriminative information is likely to locate in different regions.To address such an issue, some subsequent works start to focus on learning a metric on local features [10] or aligning local features [11,12].
Recently, some metric-based approaches introducing new alignment [12] or reconstruction [13] techniques have achieved impressive performance in fine-grained few-shot image classification.However, in the experiment, we noticed that the state-of-the-art method, feature reconstruction network (FRN) [13], suffered from overfitting during episodic training.As shown in Figure 1, while the loss function of FRN keeps decreasing on the training set, it increases on the validation set after 400 epochs.This overfitting phenomenon may occur as, comparing with that of ImageNet, CIFAR, etc., the numbers of images To summarize, the contributions of our work are three-fold: 1. We are the first to propose using self-reconstruction to increase the diversity of query features, which can effectively expand the representation capability of the learned query feature space and mitigate the overfitting problem.
2. We further propose a restrained cross-entropy loss, which can be easily

J o u r n a l P r e -p r o o f
Journal Pre-proof equipped with existing metric-based few-shot classification models.
3. Experiments on five fine-grained datasets demonstrate the superiority of the proposed method, with detailed ablation studies showing that both self-reconstruction and restrained loss are effective in alleviating overfitting.

Related Work
In this section, we first provide a concise review on fine-grained few-shot learning methods.Next, we review two types of methods that are most relevant to this work, namely metric-based methods and alignment-based methods.For a more comprehensive review on fine-grained few-shot classification, we refer readers to [14].Journal Pre-proof into global features using weights set as the stationary distribution of local features.AGPF-FSFG [17] constructed multi-scale features and reweighted them via multi-level attention.Our method also makes use of local features to build the support set or query set, which are then used to generate support-reconstructed query features and self-reconstructed query features.

Fine-Grained Few-Shot Classification
In addition, our method shares the idea of data augmentation.Methods such as Hallucinator [18] and FOT [19] assume that variations in illumination, backgrounds or poses can be shared across classes and thus can be utilized to diversify the limited support samples.In contrast, our work focuses on diversifying the query samples, and we employ the ridge regression technique for efficient self-reconstruction, eliminating the necessity to model intra-class variations.

Metric-Based Methods
Metric-based methods constitute one mainstream approach in few shot learning.These methods learn a transferable feature embedding network such that queries can be classified according to the similarity between query features and support features.The similarity can be pre-defined such as by using cosine similarity [20] or learned via a neural network [9].A pioneering metric-based method is the prototypical network [8], which first constructs prototypes as the average of support features and then compares queries with these class representations.To adapt to fine-grained classification, LMPNet [21] used multiple prototypes per class and constructed prototypes as weighted averages of feature embeddings with learnable weights.
COMET [22] learned multiple embedding functions, one for each image segment or concept, and accordingly constructed multiple concept prototypes.

J o u r n a l P r e -p r o o f
Journal Pre-proof PHR [23] learned feature embeddings at local, global, and semantic levels and updated prototypes according to novel data.SAPENet [24] obtained more representative prototypes by emphasizing discriminative local features and channels using self-attention and the proposed intra-class attention, respectively.
Another direction of development in metric-based methods is on the distance measure itself.To list a few, DeepBDC [25] proposed the Brownian distance covariance metric to exploit the joint distributions between support and query features.BSNet [26] combined cosine similarity and relation score for learning more discriminative features in fine-grained images.Temperature Network [27], despite using a single similarity measure, gradually tuned the temperature scaling parameter in the measure, which acts similarly to enforcing a large-margin metric.Different from these methods, our method combines two Euclidean distances -one is between the original query and the support-reconstructed query, and the other is between the self-reconstructed query and the same set of support-reconstructed query.

Alignment-Based Methods
One issue with metric-based methods is that position information of the embedded features of labeled samples may not correspond to that of unseen samples and therefore the distance calculated directly over these features can be very large, even for samples from the same category.To this end, alignment-based methods have been proposed [12,28,13].LRPABN [28] trained a position transformation matrix to re-arrange the position of support local features to match the query ones.DeepEMD [12] addressed the spatial inconsistency by adopting the earth mover's distance, which can be

J o u r n a l P r e -p r o o f
Journal Pre-proof interpreted as the optimal matching cost of aligning two sets of local features extracted from a support and a query image.FRN [13] reconstructed query features from support features based on ridge regression, which avoids introducing many parameters as in aforementioned methods and admits a closedform solution.Building on FRN, LCCRN [29] improved the features by utilizing information from neighborhood pixels.The resulting features were used to construct four cross-reconstruction tasks, whose reconstruction errors were combined using learnable weights.Besides spatial alignment, channel alignment has also been considered [30,31,32].TDM [30] performed channel alignment, which used attention on the support set to highlight class-wise discriminative channels and on a query instance to highlight object-relevant channels.SaberNet [31] adopted Swin Transformer as the feature extractor to capture spatial long-range dependencies between local features and aligned query features and refined prototype features at both spatial and channel levels.
In this work, we perform feature reconstruction in a similar manner to FRN [13] by adopting the ridge regression.However, our method selfreconstructs the query features from themselves, which diversifies the query features without introducing artifacts.Consequently, the representation capability of the learned query feature space is expanded, thus alleviating overfitting and leading to better generalization.

J o u r n a l P r e -p r o o f
Journal Pre-proof 3. Self-Reconstruction Network is used to construct a series of tasks and each task contains a support set S = where x denotes the image and y denotes the class label.The support set is formed by first randomly selecting N classes and then randomly selecting K images for each of these N classes.The query set is formed by randomly selecting M images for the same N classes.Features and/or metrics are learned from the labeled support set and used to perform classification on the unlabeled query set.

Feature Reconstruction by Ridge Regression
Ridge regression is one of the most widely-used penalized regression methods for analyzing multivariate data with multicollinearity.FRN [13] where y is the response vector, X is the design matrix, β is the coefficient vector, and λ is a parameter controlling the magnitude of the penalty.
In the case of feature-map reconstruction here, the response is a matrix.
Therefore, the coefficient should also be a matrix and, following the idea of ridge regression, the objective function is revised as follows: where Y denotes the response matrix, A denotes the coefficient matrix, and ∥ • ∥ F denotes the Frobenius norm.The optimal solution is given by The most expensive cost in calculating Eq. 3 is the inverse operation, which is O(q 3 ) for an q × p matrix X.When p < q, it is computationally more efficient to calculate the following equivalent solution, which is obtained by applying the Woodbury matrix identity [33]: The computational cost of Eq. 4 is O(p 3 ), which is smaller than that of Eq. 3 when p < q.
J o u r n a l P r e -p r o o f   Output: model parameter θ.Sample a task T from D train ; 3: Split T into the support set S and the query set Q; 4: Use f (•|θ) to extract class-specific features {S c } N c=1 and query features {Q q } N M q=1 ; 5: Compute the class-specific reconstructed query feature Qc using Eq.5; 7: Compute the self-reconstructed query feature Qq using Eq.6; 8: Compute the squared Euclidean distance d c,1 between the Qc and Q q using Eq.9; 9: Compute the squared Euclidean distance d c,2 between the Qc and Qq using Eq.10; 10: Obtain the distance between the query and the class c: d c,q = d c,1 +d c,2 ; 11: Obtain the final classification probability P using Eq.12; 12: Compute the loss of model Loss using Eq.15; 13: Update model parameter θ to minimize Loss using the optimizer.

Self-Reconstruction Metric Module (SRM)
In this paper, we propose a self-reconstruction metric module, which reconstructs query features from both support features and from the query features themselves while enabling both sides to be metricized.Reconstructing

J o u r n a l P r e -p r o o f
Journal Pre-proof query features from support features could achieve spatial alignment between query and support, and reconstructing query features from themselves could augment query features and increase task diversity.
In the N -way K-shot setting, the feature extraction module outputs feature maps S c and Q q , where S c contains features from K support samples of class c ∈ C and Q q (q = 1, . . ., m) is the feature of a single query sample.Moreover, we pool all features from the same class, i.e., applying a reshaping function to S c to map all features of class c into a single matrix S c : R K×hw×l → R Khw×l , where h, w, l are the height, width and number of channels of the feature map respectively.
One core component of the SRM module is the reconstruction of query features, using both support features and their own query features, as elaborated below.
The feature map Q q can be reconstructed from support features S c according to Eq. 4, generating the class-specific reconstructed query features Qc : where ρ is a learnable re-scaling parameter.Eq. 5 addresses the misalignment between query and support features.
Applying the same technique, we can reconstruct Q q from its own feature, so as to map the query feature to a reconstruction space.Note that the feature map Q q ∈ R hw×l comes from a single query image, rather than all query images.The self-reconstructed query feature Qq is obtained as follows: J o u r n a l P r e -p r o o f Journal Pre-proof The parameters ρ and λ in Eqs. 5 and 6 are not shared and they are updated in the same way as parameters in the feature extraction module during training.In order to ensure ρ and λ are positive, they are converted to e α and e β respectively, with α and β initialized to zero: After obtaining the reconstructed features, we carry out distance calculation as in metric-based methods.In this step, the class-specific reconstructed query features Qc and the self-reconstructed query features Qq are used differently, where Qc serves as class-specific prototypes for spatially aligning support features with the original query features Q q and Qq serves as additional samples for expanding the representation capability of the learned query feature space.More specifically, we calculate the squared Euclidean distance d c,1 between the support-reconstructed query features Qc and the original query feature Q q , the squared Euclidean distance d c,2 between Qc and the self-reconstructed query feature Qq , and finally sum up the two distances to get the distance d c,q , which represents the distance between the query and the class c:

J o u r n a l P r e -p r o o f
Journal Pre-proof 3.5.Loss Functions By using the above distance, the final classification probability can be obtained as: where τ is a learnable hyperparameter that controls the sharpness of the metric distance.
In the training phase, we use a joint loss integrating two loss functions to optimize the model.One is the widely-used cross-entropy (CE) loss: where y q denotes the one-hot vector, p q denotes the vector of predicted probability, and m is the number of query samples.
To further alleviate the overfitting problem, we propose to restrain the cross-entropy loss on the training classes.Specifically, we design a new restrained cross-entropy (RCE) loss in Eq. 14, which has the opposite effect to the CE loss and can prevent the learned model from being over-confident in its predictions on the training set: The final loss is obtained by combining the CE loss and RCE loss: where 0 < k < 1 adjusts the influence of the RCE loss.When combined with the CE loss as in Eq. 15, the RCE loss can effectively restrict the

J o u r n a l P r e -p r o o f
Journal Pre-proof Flowers [38], and FGVC-Aircraft (Aircraft) [39].The CUB dataset is consistent with that in [13], and the images are pre-cropped into the bounding boxes provided.All datasets are divided into base, validation, and novel datasets according to the ratio of 2:1:1.
For the feature extractor, we adopt two widely-used backbones: ResNet-

J o u r n a l P r e -p r o o f
Journal Pre-proof

Ablation Studies
To further demonstrate the effectiveness of our proposed network, we conduct 5-way 1-shot and 5-way 5-shot experiments on four model structures on Flowers and Cars datasets with ResNet-12 backbone.The four models are as follows: ProtoNet, ProtoNet with our proposed self-reconstruction metric module (SRM), ProtoNet with our proposed loss L RCE , and ProtoNet with both SRM and L RCE (i.e., Ours).As shown in Table 3, when using the SRM module on the ProtoNet, the accuracy is higher than ProtoNet in all cases.When using loss L RCE on ProtoNet, the accuracy is slightly lower than ProtoNet on the 1-shot task on Flowers, while it improves in other cases.Finally, when the SRM module and L RCE are used together (Ours), the accuracy increases dramatically.This highlights the importance of combining both techniques to get a better network.

J o u r n a l P r e -p r o o f
Journal Pre-proof As can be seen from Figure 3, our method is obviously better than the other two methods in terms of the median (orange lines) and mean (green dashed lines).Moreover, the range of classification accuracy excluding outliers (i.e., the distance between whiskers) of our method is significantly narrower than that of ProtoNet and FRN, indicating that our method is more stable and has higher confidence.Furthermore, looking at the outliers (red points), we can see that the classification accuracy of our method is also better than the other two methods on the worst-performing tasks.

Classification Accuracy Under Different Shots
To further evaluate the stability of our method, we calculate the test accuracy of ProtoNet, FRN, and our proposed method in different K-shot settings on CUB, Dogs, Flowers, and Cars datasets.cross-entropy loss at around the 420th epoch and gradually increases afterward, indicating the occurrence of overfitting.However, the cross-entropy loss does not tend to increase after the 400th epoch for our method and its two variants.Therefore, both the proposed SRM structure and restrained cross-entropy loss L RCE can effectively mitigate the overfitting problem.

Visualization of Classification Probabilities
To further evaluate the effectiveness of the proposed RCE loss in preventing over-confident predictions, we present a heat map of the classification probabilities of query samples predicted by ProtoNet, FRN, and our method (defined by Eq. 12) on the test set of Cars dataset.As shown in Figure 6, in each confusion matrix, the vertical axis shows 5 classes in a task, and the horizontal axis shows query samples in the 5 classes with each class containing 16 query samples.The main diagonal line indicates the correct classification.
Warmer color represents higher probability score.to 0.7.This is a consequence of including the restricted cross-entropy in the loss function to prevent over-confident predictions, aligning with the findings drawn from Figure 5.

Visualization of Discriminative Feature Regions
To demonstrate the effectiveness of our proposed method on feature extraction, we show the feature regions extracted from the original image by visualizing them using Grad-CAM [42].
Figure 7 shows that the discriminative feature regions extracted by our method are more concentrated compared with ProtoNet and FRN.It also extracts discriminative features that ProtoNet and FRN do not capture well, so our method could generalize better and become more robust in the presence of noises in some spatial regions.

Conclusion
In this paper, we proposed a self-reconstruction network for few-shot finegrained image classification.Our innovation includes enhancing feature diversity by self-reconstructing query samples and introducing restrained crossentropy loss to mitigate overfitting.Extensive experiments on five benchmark fine-grained datasets demonstrate the efficacy of our method with the state-of-the-art performance achieved on both 5-way 1-shot and 5-way 5-shot classification tasks.

J
o u r n a l P r e -p r o o f

J o u r
n a l P r e -p r o o f Journal Pre-proof and classes in fine-grained datasets are relatively small and thus the task diversity in episodic training is limited.To alleviate overfitting, we propose a self-reconstruction network for fewshot fine-grained classification, which introduces a new self-reconstruction metric module and a restrained cross-entropy loss.The self-reconstruction metric module not only reconstructs query features from support features as in FRN, but more importantly, it also reconstructs query features from themselves.Such self-reconstruction can effectively augment and diversify query features without introducing artifacts, and as shown in the experiment, it avoids over-reliance on one discriminative feature.After feature reconstruction, both the distance between the original query features and its supportreconstructed features and the distance between the support-reconstructed and self-reconstructed query features are calculated, whose sum is used to match a query sample and the support class.The self-reconstruction network is trained according to the classical cross-entropy loss and a new restrained cross-entropy loss.The latter can prevent the model from producing overconfident predictions which were due to overfitting the training data.By utilizing self-reconstruction and the restrained cross-entropy loss, the proposed method can avoid overfitting, as shown in Figure 1.

Fine-grained
few-shot classification faces the dual challenges of scarce labeled data and the subtle distinctions between different sub-categories, e.g., distinguishing between beagle and pug within the category of dog.Global features, which capture image-level concepts, are insufficient to discriminate between fine-grained categories.Therefore, a line of research focus on local features.DN4 [10], a notable method for fine-grained classification, first utilized intermediate levels of CNN as local features and performed classification according to the aggregated distances calculated over local features and its k-nearest neighbors.As an improvement of DN4, LSANet [15] allowed for different scales of local patches to better capture the structure information and assigned different weights to query patches to suppress the background and highlight the targets.MCL-Katz [16] aggregated local features J o u r n a l P r e -p r o o f

3. 1 .
Problem Definition The few-shot classification problem usually divides a given dataset into three sub-datasets according to different stages.D train is used for model training, D val is used for model evaluation in the training stage, and D test is used for the final test of the trained model.The three sub-datasets contain different image categories.N -way K-shot is a common setting for few-shot classification, which means that a classification task consists of N classes and each class has K labeled samples.At different stages, each sub-dataset (D train , D val and D test ) adopted this technique to reconstruct the features to solve the few-shot image classification task and achieved state-of-the-art results.For this reason, we also propose our model based on the strategy of ridge regression.J o u r n a l P r e -p r o o f Journal Pre-proof Ridge regression shrinks the coefficient estimates by adding a penalty on squared coefficient values, i.e., by minimizing the following penalized residual sum of squares: βridge = arg min

Figure 2 :
Figure 2: The model architecture of the self-reconstruction network.Support and query images are mapped to the feature space using f (•|θ).Next, these features are sent to the proposed self-reconstruction metric module, which generates reconstructed features through the reconstruction module RM and calculates the distances.The network is trained according to the proposed joint loss, which combines the cross entropy loss (L CE ) and restricted cross entropy loss (L RCE ).

3. 3 .Journal Pre-proof Algorithm 1
Architecture Overview Our network structure, as shown in Figure 2, mainly consists of three modules.The first module is the feature extraction module, which maps the original image to the embedding feature.The second is the self-reconstruction metric module (SRM), which feeds the support features and a query feature into the feature reconstruction modules (RMs) to obtain two types of reconstructed query features, one reconstructed by the support features and the other self-reconstructed by the query feature.Then the distance between the support-reconstructed query features and the original feature and J o u r n a l P r e -p r o o f Training procedure of self-reconstruction network for N -way K-shot classification Input: training data D train , number of classes N , number of support images per class K, number of query images per class M , number of episodes t, optimizer (SGD).

12 and ResNet- 18 [ 1 ,
13].The ResNet-12 structure has 4 residual blocks, and each residual block contains 3 convolutional layers with 3 × 3 convolution kernel.Each convolutional layer is followed by a batch normalization, and the first convolutional layer is followed by a ReLU nonlinearity, while a 2 × 2 maximum pooling layer is added at the end of each residual block.The input dimension of the network is 3 × 84 × 84, and the output feature dimension is 640 × 5 × 5 after feature extraction.Unlike the ResNet-18 network in [40], our ResNet-18 network is modified on the ResNet-12 network.[40] uses a 7 × 7 convolution kernel in the first convolutional layer, which is not conducive to fine-grained feature extraction as the subtle discriminative features may locate in a tiny region in fine-grained image classification.Our ResNet-18 network, like ResNet-12, has 4 residual blocks, but the first two residual blocks are divided into two sub-residual blocks; each sub-residual block contains 3 convolutional layers with 3 × 3 convolution kernel.The rest of the structure is similar to ResNet-12.Throughout the experiments, we use the setting of 10-way 5-shot for training, 5-way 5-shot for validating, and 5-way 1-shot and 5-way 5-shot for testing.The query set contains 16 images in all three phases.In the training phase, we use the SGD optimizer on all datasets with an initial learning rate of 0.1 and momentum of 0.9, and train 1,200 epochs in total.In the J o u r n a l P r e -p r o o f Journal Pre-proof

Figure 3 :
Figure 3: Boxplots of classification accuracy of ProtoNet, FRN, and our proposed method on CUB, Dogs, Flowers, and Cars datasets.All experiments are based on a 5-way 5-shot classification setup with the ResNet-12 backbone.Each method has been evaluated for 100 rounds, and the distributions of test accuracy are shown via boxplots.In each boxplot, the central orange line marks the median, and the green dashed line marks the mean; the edges of the box are the 25th and 75th percentiles, respectively; and the outliers are marked in red individually.

Figure 4 Figure 4 :Figure 5 :
Figure 4: Test accuracy of ProtoNet, FRN, and our proposed method on CUB, Dogs, Flowers, and Cars datasets in different K-shot settings.All experiments are based on a 5-way classification setup with the ResNet-12 backbone.

Figure 6 :
Figure 6: Visualization of classification probabilities predicted by ProtoNet, FRN and our proposed method on the test set of Cars dataset.In each confusion matrix, the vertical axis shows 5 classes in a task and the horizontal axis shows query samples in the 5 classes with each class containing 16 query samples.Warmer color means higher probability.Classification accuracy of each method is displayed below its respective figure.

Firstly, we notice 4 FRNFigure 7 :
Figure 7: Feature visualization under ProtoNet, FRN, and our method on the CUB (left) and Cars (right) datasets.Red indicates the learned discriminative features.The redder the region, the more discriminative the learned features.

J o u r n a l P r e -p r o o f Journal Pre-proof 4 . 5 . 5 .
Visualization of Reconstructed FeaturesTo provide a deeper insight into the reconstruction module, we train an image generator to recover images from query features.Specifically, three types of query features are considered, namely original features obtained directly after the feature extraction module, self-reconstructed query features, and support-reconstructed query features.To map features back to images, an inverted ResNet-12 decoder is trained on the 5-way 5-shot task to decode features of dimension 640×5×5 into 3×84×84.We use the Adam optimizer and L 1 loss to train the decoder.The initial learning rate of the optimizer is set as 0.01, and the batch size is set as 200.After training for 2,000 epochs, we select the parameter with the minimum loss as the parameter of the decoder.In Figure9, panel (a) shows 5 query images of the same class, and panel (e) shows 5 support images from 5 classes, where images in the third column has the same class as the query.Panels (b)-(d) are images generated from query features after feature extraction, i.e., Q q , from self-reconstructed query features, i.e., Qq , and from support-reconstructed query features, i.e., Qc .The (i, j)th image in panel (d) is the recovered image for ith query by using query features reconstructed from the jth support class.As we can see from Figure9, firstly, images recovered from self-reconstructed query features (panel (c)) are very similar to those recovered from the original query features (panel (b)), but there are some differences.For example, in the second image, feather is more blurred in the self-reconstructed case than in the original case; in the fifth image, beak is missing in the selfreconstructed case.This suggests that the self-reconstructed query features J o u r n a l P r e -p r o o f Journal Pre-proof

4. 6 .
Computational CostWe evaluate the computational complexity of the proposed method.Table 4 lists the model complexity in terms of the number of parameters and the computational cost in terms of the training time per epoch.All methods were implemented by using PyTorch on an NVIDIA RTX 3090 GPU.Compared with FRN, our method introduces an additional self-reconstruction step and distance calculation.The self-reconstruction step only adds two new parameters, namely ρ and λ in Eq. 6.Moreover, the increase in training time is marginal.

Table 1 :
Comparison of few-shot classification methods on CUB-200-2011, Flowers, FGVC Aircraft, Stanford-Cars, and Stanford-Dogs datasets.All experiments adopt ResNet-12 as the backbone network.Mean accuracy and 95% confidence interval are reported.The best-performing methods are shown in bold and the second best ones are underlined.

Table 2 :
Comparison of few-shot classification methods on CUB-200-2011, Flowers, FGVC Aircraft, Stanford-Cars, and Stanford-Dogs datasets.All experiments adopt ResNet-18 as the backbone network.Mean accuracy and 95% confidence interval are reported.The best-performing methods are shown in bold and the second best ones are underlined.

Table 3 :
Ablation studies on the Flowers and Cars datasets.All experiments adopt the ResNet-12 backbone.Mean accuracy and its 95% confidence interval are reported.ProtoNet with both SRM and L RCE is our proposed network, noted in the table as Ours.

Table 4 :
Comparison of model complexity and computational cost.