Toward Prototypical Part Interpretable Similarity Learning With ProtoMetric

The Prototypical Part Network (ProtoPNet) is an interpretable deep learning model that combines the strong power of deep learning with the interpretability of case-based reasoning, thereby achieving high accuracy while keeping its reasoning process interpretable without any additional supervision. Thanks to these advantages, ProtoPNet has attracted significant attention and spawned many variants that improve both the accuracy and the computational efficiency. However, since ProtoPNet and its variants (ProtoPNets) adopt a training strategy specific to linear classifiers or decision trees, they run into difficulty when utilized for similarity learning, which is a practically useful technique for cases in which unknown classes exist. To solve this problem, we propose ProtoMetric, an extension of ProtoPNet that is applicable to similarity learning. Extensive experiments on multiple open datasets for fine-grained image classification demonstrate that ProtoMetric achieves a similar accuracy as state-of-the-art ProtoPNets with a smaller number of prototypes. We also demonstrate through case studies that ProtoMetric is applicable to image retrieval tasks where the class labels of the training and test sets are completely different.


I. INTRODUCTION
Deep learning has achieved high accuracy in a variety of computer vision tasks. However, since the reasoning process of deep learning models is black-boxed and cannot be interpreted by human operators, it is very difficult to validate their inference, and this impedes their utilization in high-risk domains. To alleviate this problem, researchers have dedicated significant effort to constructing inherently interpretable models, but such models generally suffer from degraded accuracy compared to their black-box counterparts. 'Gray-box' models [1], [2], [3] have thus emerged as a way of retaining the advantages of deep learning models while keeping the reasoning process interpretable. Among these, ProtoPNet with the 'this looks like that' framework [1] has attracted much interest because it can guarantee a transparent reasoning process without any additional supervision.
The associate editor coordinating the review of this manuscript and approving it for publication was Shovan Barma .
ProtoPNet first calculates the similarity of input samples to prototypes that represent a certain image patch contained in the training set and then classifies the samples on the basis of that similarity in a white-box manner. This process enables ProtoPNet to explain its reasoning process by providing the patches in the training sets that the model considers similar to the input sample. Thus, interpretability with casebased reasoning is achieved. Thanks to this transparency, many variations of ProtoPNet, known as ProtoPNets, have been proposed [4], [5], [6], [7], [8]. However, despite their effectiveness, ProtoPNets run into difficulty when utilized for similarity learning because they require a pre-defined relationship between prototypes and class labels. To alleviate this problem, we proposed a novel cluster loss in our conference paper [9] and constructed a 'this looks like that' framework for similarity learning. However, three problems remain in our method: (1) only one loss function (i.e., margin loss [10]) can be utilized, which means it cannot benefit from the progress in deep metric learning, (2) its suitability in situations where similarity learning is effective (e.g., unknown classes exist) has not been sufficiently examined, and (3) the computational cost for inference is high due to the KNN classification.
In this paper, we therefore propose ProtoMetric, an improved version of our conference method. We also demonstrate that ProtoMetric can be applied to image retrieval tasks by utilizing the knowledge distillation method [11]. The key contributions of this paper are as follows: • We propose ProtoMetric, an improved version of our conference method that enables us to utilize a wide range of loss functions and take full advantage of the progress in deep metric learning. In addition, the inference cost of ProtoMetric is reduced to a level similar to that of other ProtoPNets by utilizing the class center.
• Through case studies, we confirm that ProtoMetric can be utilized in image retrieval tasks. To the best of our knowledge, this is the first work to construct a 'this looks like that' framework in image retrieval tasks.
• Extensive experiments on multiple open datasets for fine-grained image classification demonstrate that Pro-toMetric can achieve comparable results to other ProtoP-Nets with a smaller number of prototypes.

II. RELATED WORKS
Research on interpretability in deep learning models can be divided into two approaches: post-hoc and ante-hoc.
In the following, we briefly describe each approach along with the studies on deep metric learning that are closely related to the current work.

A. POST-HOC APPROACH
The post-hoc approach [12], [13], [14] aims at interpreting the inference of already trained black-box models and usually indicates the image regions that contribute the most to model inference. As such, the post-hoc approach can provide an intuitive, easy-to-understand explanation. In addition, it can be utilized without re-training on the already deployed blackbox models. Unfortunately, the explanations provided by the post-hoc approach sometimes have nothing to do with the model inference [15]. To alleviate this problem in fidelity, some researchers have utilized the Shapley value [16] and proposed methods with a theoretical guarantee [17], [18]. However, since the calculation of Shapley values is NP-hard, it is difficult to apply these methods to high-dimensional data such as raw images. Furthermore, the explanations provided by post-hoc approaches are completely different from those provided by ProtoPNets, which makes it difficult to compare the two. How to evaluate the explanatory methods for deep learning model inferences in a unified manner is an open question, and this is why we limit our counterparts to Pro-toPNets in this paper.

B. ANTE-HOC APPROACH
The ante-hoc approach aims at constructing inherently interpretable models, and a wide variety of such models has been proposed [1], [2], [3], [19], [20], [21], [22]. Since the explanations provided by ante-hoc approaches are closely related to the model inference, the fidelity of explanation is guaranteed. Among the methods based on the ante-hoc approach, the 'this looks like that' framework with ProtoPNet [1] has attracted significant attention because it can implement a transparent reasoning process without any additional supervision. In the following, we describe the details of several ProtoPNets that are closely related to our method. Following Chen et al.'s [1] original proposal of the 'this looks like that' framework with ProtoPNet, many variants have been developed to improve both the accuracy and the computational efficiency [4], [5], [6], [7], [8]. In addition to these variants, model debugging by re-training prototypes [23] and the application to tasks other than image classification, such as semantic segmentation [24], reinforcement learning [25], and graph classification [26], have also been proposed. However, few researchers have attempted to apply the 'this looks like that' framework to similarity learning, even though similarity learning has important practical applications such as image retrieval and person re-identification. Conventional ProtoPNets run into difficulty with similarity learning because their training requires the pre-defined relationship between prototypes and class labels, and thus is specific to linear classifiers. As exceptions, ProtoTree [4] and ProtoPool [8] do not require this pre-defined relationship. ProtoTree, which utilizes a decision tree, has achieved high accuracy while reducing the number of prototypes thanks to sharing prototypes between different class labels. However, since its training process is specific to decision trees, it is difficult to apply it to similarity learning. Alternatively, Pro-toPool assigns prototypes to 'slots' belonging to a specific class during the training rather than directly pre-defining the relationship between prototypes and class labels. This is quite successful and enables high accuracy with a small number of prototypes. However, since the 'slots' are specific to a certain class, it is difficult to apply ProtoPool to classifiers other than the linear classifier. To alleviate these problems, we extended ProtoPNet to similarity learning in our earlier work [9], and in this paper, we present ProtoMetric, an improved version of our earlier method that can be applied to image retrieval tasks where class labels in the training set are completely different from those in the test set. To the best of our knowledge, this is the first work to construct a 'this looks like that' framework in image retrieval tasks.

C. OVERVIEW OF DEEP METRIC LEARNING
In deep metric learning, most methods take a ranking-based approach [10], [27], [28], in which a pair of samples is constructed in the mini-batch and the distance between them is optimized. However, the ranking-based approach has a problem in that the training process is complicated since the number of sample combinations increases exponentially. Proxy-based methods [29], [30], [31], [32] have been proposed to alleviate this problem, and have achieved high  accuracy. In our experiments (discussed later), we chose one typical loss function from each group, namely, margin loss [10] and proxy-anchor loss [31], because the learned feature representation might be different when using the rankingbased losses and when using the proxy-based losses [33]. Deep metric learning methods usually measure the similarity between images with cosine similarity, and so this is the procedure we followed in this paper.

III. METHOD
Unless otherwise specified, we utilize the same mathematical notation as [34]. 1 In the following, we first describe the model architecture of ProtoMetric (Sec. III-A), then explain its training process (Sec. III-B), and finally discuss how to interpret its inference (Sec. III-C).

A. MODEL ARCHITECTURE
As shown in Fig. 1, the model architecture of ProtoMetric is composed of a convolutional layer, attention layer, prototype layer, and fully connected layer. In the following, we explain the details of the model inference process and the multi-head trick utilized in our model. The model inference process is also summarized in Alg. 1.

1) MODEL INFERENCE PROCESS
Input image x a is first transformed into feature map Z B a by the convolutional layer, where the subscript a indicates the data index. We use a, b, . . . to denote data indices unless otherwise specified. Then, the attention layer extracts the foreground region on the basis of Z B a . To do so, it calculates the ranking activation mapping (RAM) [ the t ram percentile of the total attention mass. In the following, we refer toǍ a,hw as the attention mask.
Next, the prototype layer applies a 1×1 convolutional layer to Z B a and obtains Z P a . Then, the cosine similarity between prototypes {p i } i=0,1,... and the feature vectors contained in each pixel of Z P a are calculated to obtain the prototype heatmap H a . I.e., the component of H a corresponding to the prototype p i at the position (h, w) is defined as where z P a,hw is the feature vector contained in Z a at the position (h, w). The subscript i indicates the index of the prototype, and we use i, j, . . . to denote the prototype indices in the following. We also denote the L2 normalized z P a,hw and p i aŝ z P a,hw andp i , respectively. In the following, the L2 normalized vector of v is denoted asv unless otherwise specified. We also refer to z P a,hw as the image patch feature. The prototype layer then outputs the maximum similarity value in the foreground (whereǍ a,hw takes 1) as the similarity of the input image x a to each of the prototypes, which is formulated as We denote the similarity of the input image x a to the prototypes as s a and its component corresponding to i-th prototype p i as s a,i . We refer to s a as the 'prototype profile' in this paper. Finally, the prototype profile s a is transformed into f P a by the fully connected layer, which is instantiated with a linear layer without a bias term. We usef P a to calculate the similarity between images.

2) MULTI-HEAD TRICK IN PROTOTYPE LAYER
As we will explain in Sec. III-B3, we estimate the attribution of prototypes to each sample by comparing the samples contained in a mini-batch. It becomes difficult to properly estimate the attribution when the number of prototypes is large because we cannot obtain enough pairs of samples for comparison to estimate the attribution in such cases. To alleviate this problem, we apply a multi-head trick on the prototype layer. Specifically, as shown in Fig. 2, we split the feature map Z P a into H heads along the channel axis and then assign P/H prototypes and calculate the cluster loss for each head. This enables us to estimate the relationship between prototypes and class labels efficiently even when the number of prototypes is large. Note that we set the number of heads H to 1 (i.e., not utilizing the multi-head trick) unless otherwise specified.

B. TRAINING PROCESS
Since we do not utilize the pre-defined relationship between prototypes and class labels, we can train ProtoMetric in an end-to-end manner instead of the sub-optimal two-stage training strategy that conventional ProtoPNets adopt. Specifically, ProtoMetric is trained to minimize four loss functions: task loss, auxiliary loss, cluster loss, and suppression loss. We explain the details of each loss function in the following. We also summarize the overall training process in Alg. 2.

1) TASK LOSS L task
Task loss L task is imposed on f P a output by the fully connected layer to help ProtoMetric perform well on a given task. In this paper, as the task loss, we adopt proxy anchor loss [31] and relational knowledge distillation (RKD) [11] loss for the finegrained image classification and image retrieval tasks, respectively. This is because we need to cope with the overfitting problem in image retrieval tasks while training ProtoMetric with a sufficient number of iterations to bring the prototypes adequately close to the image patch features. In the following, FIGURE 2. Prototype layer with multi-head trick. Feature map Z P a is divided into H groups along the channel axis and prototype heatmaps are then calculated for each head. Here, each head has P H number of prototypes, where H and P are the number of heads and the total number of prototypes, respectively. After the prototype heatmaps are calculated, they are concatenated and passed to subsequent layers.
we provide the formulation of the task loss in each task. For clarity, we refer to f P a as f a in this section.

a: FINE-GRAINED IMAGE CLASSIFICATION SETTING
As described above, we use proxy anchor loss [31] as the task loss L task in the fine-grained image classification setting. Thus, L task in this setting is formulated as where δ, α, g q , Q, and Q + are the margin, scaling factor, proxy indexed by q, overall sets of the proxy index, and sets of the proxy index that belong to the class labels contained in a mini-batch, respectively. B + q are the sets of data indices with the same class labels as proxy g q , and B − q is vice versa.

b: IMAGE RETRIEVAL SETTING
In this setting, we train ProtoMetric by distilling the blackbox model with RKD loss [11], which consists of two loss functions: distance-wise distillation loss L D and angle-wise distillation loss L A . Since the original version of RKD loss is designed for Euclidean space, we modified it to account for the geometry of hyper-spherical space. The details of the modification are discussed below. Distance-wise distillation loss L D is imposed so that the image distance inferred by the student model is the same as the image distance inferred by the teacher model. Because the distance between two points on a unit hypersphere is defined as the shorter length of the arc of a great circle connecting two points, we can define the distance-wise distillation loss on a unit hypersphere as where the feature vector output by the teacher model is denoted ast a and l δ is the smooth L1 loss, which is VOLUME 11, 2023 for mini-bath B = {x a , y a } a=0,1,... ∈ T do 5: // Encode the input images 6: Take input image x a into model F and obtain feature vectors f B a and f P a , foreground maskǍ a , prototype profile s a , and prototype heatmap H a following Algorithm 1. Sample the prototypes following Eq. 10 and obtain the sampling results E(x a , p i ). 9: Modify E(x a , p i ) to E ′ (y a , p i ) with Sinkhorn algorithm so that Eq. 14 is satisfied. 10: Update the external memory following Eq. 15.

11:
Solve the linear assignment problem defined in Eq. 17 and obtain the solution T y * a,i . 12: Calculate the cluster loss L clst following Eq. 17. 13: // Other losses 14: Calculate the task loss L task with f P a and y a . 15: Calculate the auxiliary loss L aux with f B a and y a . 16: Calculate the suppression loss L supp with H a andǍ a following Eq. 18.

18:
// Model update 19: Update the model parameters θ and the prototypes on the basis of the gradient with respect to them. 20: end for 21: end for 22: Conduct prototype projection following Eq. 20. formulated as Angle-wise distillation loss L A is imposed so that angles formed by the triplet of feature vectors output by the teacher and student model are the same. As shown in Fig. 3, angle ̸ BAC formed by three points A, B, C on the hypersphere is the same as ̸ B ′ AC ′ formed by three points on the tangent plane tangent to the hypersphere at point A. Thus, the anglewise distillation loss is re-formulated as where The task loss in the image retrieval setting is then formulated as where λ D and λ A are the hyperparameters that adjust the weight of each loss function. We set them to 1.0 and 2.0, respectively, following [11].

2) AUXILIARY LOSS
Auxiliary loss is imposed on f B a output by the attention layer to help the convolutional layer acquire good feature extraction ability and good attention masks. We adopted margin loss [10] as the auxiliary loss in this paper. Thus, the auxiliary loss L aux is formulated as where m and β are the margin and the learnable parameter, and a, p, and n are the data index of the anchor, positive, and negative samples, respectively.
[·] + indicates the ReLU function and N is the number of terms in the summation that take non-zero values. d ap and d an are the Euclidean distance defined as ∥f respectively. In the image retrieval task, we additionally adopted regularization loss [36] to alleviate the overfitting.

3) CLUSTER LOSS
Cluster loss is imposed to link prototypes to image patch features so that each prototype represents a certain image patch in the training set. As described above and detailed in our conference paper [9], the pre-defined relationship between prototypes and class labels is no longer available in similarity learning, which makes it difficult to calculate the cluster loss in the same way as the conventional ProtoPNets. To alleviate this problem, we came up with a novel cluster loss that first estimates the attribution of prototypes to each class and then calculates the cluster loss on the basis of that estimation [9]. We call the attribution of prototypes to each class 'prototype affiliation' in this paper. In the following, we describe the details of the proposed cluster loss and its improvements from our conference paper. Figure 4 shows the overall estimation process of prototype affiliation in our cluster loss. Following our conference paper [9], we extract the sample-specific prototypes by comparing the samples contained in the mini-batch. How much prototype p i is contained in the sample x a is represented by the i-th component of the prototype profile s a,i . Thus, x a is expected to contain p i when s a,i takes a larger value compared to s b,i . Therefore, by extracting prototypes with a large difference between s a and s b for all of the data indices in the minibatch, we can obtain the set of prototypes contained in x a . This extraction process is achieved by stochastic sampling with the Gumbel top-k trick [37], which is formulated as

a: ESTIMATION OF PROTOTYPE AFFILIATION
where τ is the temperature parameter (fixed to 0.05 in this paper). k is the Gumbel top-k operation formulated as where γ * ( * ∈ {j, k}) are random variables that follow the standard Gumbel distribution, and 1 is the indicator function that returns 1 if the condition in parentheses is true and 0 otherwise. We set k in the Gumbel top-k operation to 3 in this paper. After the sampling, we average the sampling results E(x a , p i ) among the samples that have the same class label, as E(y, p i ) = a∈B 1(y a = y)E(x a , p i ) a∈B 1(y a = y) .
The averaged result E(y, p i ) is still biased towards a specific prototype independent of the class labels, which is undesirable, so we need to remove the bias. To do so, as detailed in our conference paper, we first normalize the matrix E so that the sums in the column direction are equal and then normalize the sums in the row direction to be equal, where E is the matrix whose component in column y and row i is E(y, p i ), and E y,i ≡ E[y, i] ≡ E(y, p i ). Here, the normalization in the column direction is conducted so that the average of prototype affiliation with respect to class labels has the same value across prototypes, and the normalization in the row direction is conducted so that each class contains the same amount of prototype affiliation. However, because this 'de-bias' procedure does not consider whether a class is included in the mini-batch, it may introduce another bias into the training.
To alleviate this problem, we propose a new de-biasing method that considers whether a class is included in the minibatch and normalizes the row and the column direction at the same time. Simultaneous normalization along with the row and the column direction is formulated as a matrix scaling problem that transforms the given matrix E so that the sum in the row and column directions of E is u and v, respectively, as which is a problem that can be solved with Sinkhorn iteration [38], [39]. In this paper, we set v j to 1 for all j to satisfy the condition that each class contains the same amount of prototype affiliation (i.e., 1). Thus, the only remaining problem to achieve the desired de-bias is to determine u considering whether each class label is included in the mini-batch or not. For this purpose, we utilize an external memory that contains the prototype affiliation for each class. Specifically, we determine u such that the sum of the prototype affiliation with respect to the class labels contained in the minibatch is equal to the sum of the prototype affiliation contained in the external memory with respect to the corresponding class labels, as where C, Y, P, and |P| refer to the number of class labels contained in the training set, the set of class labels contained in the mini-batch, the set of overall prototype indices, and the cardinality of P. M is initialized with the same value C/|P| and then updated by where m is the momentum (fixed to 0.9 in this paper).

b: CALCULATION OF CLUSTER LOSS
Once we estimate the prototype affiliation, we can calculate the cluster loss on the basis of that estimation. Following our conference paper [9], we modified the prototype affiliation for each sample to account for the absence of prototypes with high affiliation to the class due to differences in the object views. Specifically, we formulate the modification of prototype affiliation as a linear assignment problem:  (16) where N y is the number of samples whose class label is y, and C a,i is the distance between prototype p i and image x a , defined as C a,i = min z∈Z P a ∥z − p i ∥ 2 2 . We solve the linear assignment problem of Eq. 16 with the Sinkhorn-Knopp algorithm [40]. The solution of Eq. 16, T y * a,i , is small when sample x a does not contain prototype p i , i.e., when the distance between them (C a,i ) is large, and high the other way around. Because of the condition in Eq. 16, the average of T y * a,i among the samples with the same class labels is equivalent to the prototype attribution to the class. Therefore, T y * a,i can be considered as the modified prototype attribution for VOLUME 11, 2023 each sample. We thus use T y * a,i to calculate the cluster loss, as Here, we utilize the trick [8] that maximizes the gap between the maximum and the average prototype activation in the image so that prototypes do not represent the background image patches.

4) SUPPRESSION LOSS L supp
We utilize the attention maskǍ a,hw output by the attention layer to prevent prototypes from representing the background image patch. Specifically, we introduce the suppression loss, which is formulated as The suppression loss prevents prototypes from image patches belonging to the background (whereǍ a,hw = 0), and thus prevents prototypes from representing background image patches.

5) OVERALL LOSS AND PROTOTYPE PROJECTION
The overall loss function we minimize to train ProtoMetric is formulated as where λ aux , λ clst , and λ supp are the hyperparameters that adjust the weight of each loss function. After the training, we project prototypes onto the nearest image patches in the training set following where T is the set of data indices contained in the training set. This guarantees that each prototype represents the corresponding image patches, and thus the interpretability with case-based reasoning is achieved.

C. INFERENCE AND INTERPRETATION
In this section, we explain how to interpret the inference of ProtoMetric in both fine-grained image classification and image retrieval settings.

a: FINE-GRAINED CLASSIFICATION SETTING
In the fine-grained image classification setting, class centers c y are first calculated by averaging the feature vectors obtained from the training data for each class. ProtoMetric then outputs the class label corresponding to the class center that has the maximum cosine similarity to the input sample as the prediction. Thus, the prediction label y pred of the input sample x a is given by As described in Sec. II-C, ProtoMetric infers the image similarity with the cosine similarity. In the same way, the similarity between x a and x b can be written aŝ Therefore, pairs of prototypes corresponding to the largest components s a,j R a,b j,k s b,k in Eq. 23 are the reason for x a being similar to

IV. EXPERIMENTS
We mainly evaluate ProtoMetric on CUB200-2011 [41], [42] and Stanford Cars [43], which are commonly utilized datasets to evaluate ProtoPNets and deep metric learning methods. Additionally, we used Stanford Dogs [44] for the comparison with Deformable ProtoPNet [7]. In the following, we describe the implementation details in Sec. IV-A and the comparison results with other ProtoPNets in accuracy on fine-grained image classification tasks in Sec. IV-B. Next, we discuss the ablation study results and qualitative evaluation results in the fine-grained image classification task in Secs. IV-C and IV-D, respectively. Finally, we present the results of case studies in which ProtoMetric is utilized for image retrieval tasks in Sec. IV-E. We report the number of prototypes (No. of proto.) and top-1 accuracy (Acc.) here. 'iN' means that the model backbones were pretrained on i-Naturalist 2017. The first and second most accurate methods are indicated with bold and underline, respectively. We also change the font of our method to blue for clarity. Please refer to the main text for more details.

A. IMPLEMENTATION DETAILS
Our implementation is based on the work of Roth et al. [45]. We cropped the images from the CUB200-2011 and Stanford Cars datasets with bounding boxes surrounding the objects, following previous studies [1], [5], [8]. However, we used the full images when conducting experiments on the Stanford Dogs dataset, following [7]. We utilized the models pretrained on i-Naturalist 2017 when conducting experiments on CUB200-2011 and utilizing ResNet 50 for the convolutional layer of ProtoMetric, following the previous studies [4], [7], [8]. We also used the models pre-trained on ImageNet for the other experimental settings. The size of the input images was transformed to 224 × 224, so the resolution of the feature map output from the model backbone was 7 × 7. Following previous studies [4], [8], we reduced the number of channels in the feature map Z P to 128 for the experiments on Stanford Cars and to 256 for CUB200-2011 and Stanford Dogs by applying a 1 × 1 convolutional layer to Z B . We also set the number of channels of the feature vector f B to 128 for the experiments on Stanford Cars and to 256 for CUB200-2011 and Stanford Dogs. The number of channels of feature vector f P output by the fully connected layer was set to 512. We set the number of prototypes to 195 for the experiments on Stanford Cars and to 202 for CUB200-2011 as a default, We report the number of prototypes (No. of proto.) and top-1 accuracy (Acc.) here. Methods with first and second best accuracy are indicated with bold and underline, respectively. We also change the font of our method to blue for clarity. Please refer to the main text for more details.
following the previous studies [4], [8]. We also set the number of prototypes and the number of heads to 368 and 4 for the experiments on image retrieval tasks and to 512 and 8 for Stanford Dogs, respectively. We set the weight of each loss function λ aux , λ clst , and λ supp to 1.0, 4.0, and 0.1 in the fine-grained image classification settings and to 1.0, 0.8, and 0.05 in the image retrieval settings, respectively. The threshold of attention mask t ram was set to 0.9 and 0.99 in the fine-grained image classification and image retrieval settings, respectively. Hyper parameters for proxy anchor loss [31], margin loss [10], and multi-level distance regularization (MDR) loss [36] were set in accordance with the original papers. We adopted the Adam optimizer and set the learning rate to 1e-5 for the convolutional layers and to 5e-4 for the other layers. The number of epochs was set to 150, 200, and 150 for the experiments on CUB200-2011, Stanford Cars, and Stanford Dogs, respectively, but was set to 80 when training the ResNet 50 pre-trained on i-Naturalist 2017 for CUB200-2011. This is because the model has acquired good feature extraction ability for that domain and we do not need to train ProtoMetric so long. We also set the number of epochs to 150 for the image retrieval settings. In all of the settings, we trained ProtoMetric without L clst or L supp for the first ten epochs. For data augmentation, we utilized RandomPerspective, ColorJitter, RandomHorizontalFlip, RandomAffine, and RandomCrop following the previous study [4]. We also constructed the minibatch so that it contained 56 classes with two images per class. In this paper, we report the average experimental results with three different seeds.

Tables 1, 2, and 3 show the comparison results with other
ProtoPNets on CUB200-1011, Stanford Cars, and Stanford Dogs, respectively. In Tables 1 and 2, we set the number of heads to four when the number of prototypes was 512. As shown, ProtoMetric achieved a higher accuracy than ProtoPool in all of the experimental settings with the same number of prototypes. This is because ProtoMetric can be We report the number of prototypes (No. of proto.) and top-1 accuracy (Acc.) here. Methods with the first and second best accuracy are indicated with bold and underline, respectively. We also change the font of our method to blue for clarity. Please refer to the main text for more details. trained without the sub-optimal two-stage training thanks to our proposed cluster loss. We can also see that ProtoMetric achieved a comparable result to the state-of-the-art TesNet and Deformable ProtoPNet with a smaller number of prototypes (Tables 1, 2, and 3). The slight accuracy degradation compared to TesNet might be because TesNet constructs good feature representations by utilizing the pre-defined relationship between prototypes and class labels. Further improvement of ProtoMetric in accuracy by utilizing such methods will be the focus of our future work.

C. ABLATION STUDY
In this section, we first explain the results of our ablation study on the task loss L task and auxiliary loss L aux , and then present the results of the ablation study on the cluster loss L clst . Table 4 shows the results of the ablation study on the task loss and the auxiliary loss. As we can see, the accuracy does not change regardless of whether the proxy anchor loss or margin loss is used for auxiliary loss, but it decreases significantly when auxiliary loss is not imposed. This is because the linear  layer in the attention layer is not trained and thus ProtoMetric cannot obtain a good attention mask when we impose no auxiliary loss during the training. Therefore, the auxiliary loss is confirmed to be necessary for ProtoMetric to obtain a good attention mask and achieve high accuracy. Table 4 also shows that ProtoMetric achieves a higher accuracy when proxy anchor loss is utilized for the task loss than when margin loss is utilized. As this result suggests that the choice of task loss affects the final accuracy of ProtoMetric, we set the task loss as proxy anchor loss by default. We also set the auxiliary loss to margin loss as a default because the regularization loss [36] introduced in image retrieval tasks considers the ranking-based loss function.

b: ABLATION STUDY FOR L clst
We conducted the ablation study to validate each component of our cluster loss. As shown in Table 5, the cosine similarities between prototypes and their nearest image patch features are small and the accuracy degrades after the prototype projection (Eq. 20) when we do not utilize the cluster loss ('w/o Cluster loss' in Table 5). The degradation in accuracy after the prototype projection suggests that prototypes significantly change before and after the projection. Thus, the 'this looks like that' framework is not feasible in this case because prototypes cannot be said to represent a certain image patch. We can TABLE 5. Results of ablation study for cluster loss. We conducted the experiments on CUB200-2011, and ResNet 50 pretrained on i-Naturalist was adopted for the convolutional layer. Here, we report the average cosine similarity (denoted as 'Avg. cossim. ') between the prototypes and the nearest image patch features. We also report the top-1 accuracy [%] before and after the prototype projection (denoted as 'Acc. before' and 'Acc. after', respectively). also confirm that the results when we link prototypes to the image patch with the highest similarity in each sample ('Take the most similar') are almost the same as the result of 'w/o Cluster loss'. Thus, the 'this looks like that' framework is not feasible in the 'take the most similar' setting, either.
In contrast, the cosine similarities between prototypes and their nearest image patch features increase and the difference in accuracy before and after the prototype projection decreases when we use the class averaged sampling result E(y, p i ) in Eq. 12 as the prototype affiliation ('Sampling (Eq. 10)'). This confirms the effectiveness of estimating the prototype affiliation on the basis of the difference in prototype profile among samples contained in a mini-batch. In addition, when we remove the bias considering which class label is sampled in the mini-batch by utilizing the external memory ('+ Debias with memory (Eq. 13)'), the cosine similarities increase significantly and the accuracy is almost the same before and after the prototype projection. This is because we can properly link all of the prototypes to image patches on the basis of prototype affiliation by modification considering the sampling bias with external memory. Thus, by comparing the samples contained in the mini-batch and removing the sampling bias with the external memory, the 'this looks like that' framework is feasible without having to pre-define the relationship between prototypes and class labels.
Furthermore, the cosine similarities and the accuracy after the prototype projection increase slightly when we modify the prototype affiliation for each sample by considering the difference in view of samples ('+ Sample-wise modification (Eq. 17)'). This is because we can properly link prototypes and image patches by considering the difference in view of samples. These results demonstrate the effectiveness of each module of our cluster loss. 6 show examples of the interpretation of class prediction with ProtoMetric on CUB200-2011 and Stanford Cars, respectively. As discussed in Sec. III-C, the prototypes corresponding to the largest s a,j C a,y j in Eq. 22 are the reason image x a is classified into class y. Therefore, we sort the s a,j C a,y j in descending order and enumerate them as shown in the figures. Figure 5 shows that the largest rationale for Teacher is the black-box model we utilized to perform the distillation. ResNet 50 pretrained on ImageNet was adopted for the convolutional layer on both CUB200-2011 and Stanford Cars. Here, we adopted Rank-1, 2, 4 accuracy and the normalized mutual information for evaluation metrics. Please refer to the main text for more details. classifying the input image as 'green-tailed towhee' is the reddish brown crown, which increases the class score by 0.300. We can interpret the reasoning of the class prediction quantitatively by repeating the same process for the remaining terms. These results demonstrate that we can interpret the class prediction of ProtoMetric in the same way as the other ProtoPNets.

E. APPLICATION TO IMAGE RETRIEVAL TASK
In this section, we describe the results of applying Proto-Metric to image retrieval tasks. As discussed in Sec. III-B, in image retrieval tasks, we train ProtoMetric by distilling a black-box teacher model. Table 6 lists the accuracy of the black-box teacher model and ProtoMetric at the end of training. As we can see, ProtoMetric achieves almost the same accuracy as the black-box teacher model after the training, which confirms that it can be trained without overfitting and achieve a high accuracy in image retrieval tasks by distilling a black-box teacher model. Please note that our purpose in this case study is to confirm whether ProtoMetric can be applied to an image retrieval task, and a comparison with other deep metric learning methods is beyond the scope of this paper. Figures 7 and 8 show examples of the interpretation of the image similarity inferred by ProtoMetric on the CUB200- is the image similarity inferred by ProtoMetric. As discussed in Sec. III-C, the pair of prototypes corresponding to the largest s a,j R a,b j,k s b,k in Eq. 23 is the reason the query image x a is similar to the gallery image x b . Therefore, we sort the s a,j R a,b j,k s b,k in descending order and enumerate them as shown in the figures. As we can see in Fig. 7, the largest rationale for x a being similar to x b is the pattern of the feather, which increases the image similarity by 0.078. We can interpret the reasoning of the image similarity quantitatively by repeating the same process for the remaining terms. These results confirm that we can construct the 'this looks like that' framework in image retrieval tasks with ProtoMetric.

V. CONCLUSION
In this paper, we proposed ProtoMetric, an extension of Pro-toPNet that can be applied to similarity-based learning. Our experimental findings show that ProtoMetric achieved competitive results with state-of-the-art ProtoPNets with a smaller number of prototypes in multiple open datasets for finegrained image classification. Notably, ProtoMetric achieved a higher accuracy than ProtoPool with the same number of prototypes. Further, through case studies, we showed that ProtoMetric can be applied to image retrieval tasks in which the conventional ProtoPNets are difficult to utilize. To the best of our knowledge, this is the first work to construct a 'this looks like that' framework in image retrieval tasks. Our intension is for this work to motivate other researchers to develop inherently interpretable models in various computer vision tasks.
The limitation of ProtoMetric is that it requires a pretrained black box teacher model for the image retrieval tasks, which incurs a large computational cost. The teacher model is required to prevent overfitting while training the ProtoMetric with a sufficient number of training iterations to bring prototypes sufficiently close to a certain image patch. How to prevent the overfitting to the training dataset is an open question in deep metric learning, and further investigation is needed to mitigate this.