Case-based Similar Image Retrieval for Weakly Annotated Large Histopathological Images of Malignant Lymphoma Using Deep Metric Learning

In the present study, we propose a novel case-based similar image retrieval (SIR) method for hematoxylin and eosin (H&E)-stained histopathological images of malignant lymphoma. When a whole slide image (WSI) is used as an input query, it is desirable to be able to retrieve similar cases by focusing on image patches in pathologically important regions such as tumor cells. To address this problem, we employ attention-based multiple instance learning, which enables us to focus on tumor-specific regions when the similarity between cases is computed. Moreover, we employ contrastive distance metric learning to incorporate immunohistochemical (IHC) staining patterns as useful supervised information for defining appropriate similarity between heterogeneous malignant lymphoma cases. In the experiment with 249 malignant lymphoma patients, we confirmed that the proposed method exhibited higher evaluation measures than the baseline case-based SIR methods. Furthermore, the subjective evaluation by pathologists revealed that our similarity measure using IHC staining patterns is appropriate for representing the similarity of H&E-stained tissue images for malignant lymphoma.

an indicator known as the blue ratio, which is higher in regions with more cells, was used as the criterion for selecting informative patches. In [27], clustering was first applied to patches, and a representative patch from each cluster was considered as an informative patch. Although these approaches use simple methods for selecting informative patches, we employ attention-based MIL such that informative patches can be selected from the tumor region. For pathological image classification based on WSIs that contain normal and tumor cells, MIL has been demonstrated to be effective [32,24,9,7,48,3,43,19]. Attention-based MIL [19] is particularly useful because it can quantify the relative importance of each image patch in the WSI as attention weight. In the past works for digital pathology, it is known that high-attention regions in attention-based MIL correspond to subtype-specific regions and are often regarded as tumor regions because the information that determines subtype classification would exist in tumor cells [19,52,30,20].
Actually, attention-based MIL was applied to the subtype classification of malignant lymphoma in which it was shown that computed attention regions corresponded to positive cells in the IHC stained tissue specimen [19]. In this study, we incorporate attention-based MIL into case-based SIR tasks that enable the learning of distance measures based only on the information in the tumor region.
In SIR, DML has been effectively used to obtain an appropriate distance metric [47,17,36,22,41,51]. DML methods can be roughly categorized into parametric distance measure-and feature learning-based approaches. The former approach includes Mahalanobis DML [49,40,4] and multiple kernel learning [42,38,14,15]. When the SIR method is implemented with a deep neural network (DNN) model, a feature learning-based approach is often adopted.
For instance, given a class label for each image, feature learning is performed based on a loss function such that images with the same label are closer together, whereas images with different labels are farther apart [41,22,36]. By selecting images based on the distance in the learned feature space, the distance metric in the SIR method can properly consider the class labels. The proposed SIR method is designed to retrieve similar cases with similar IHC staining patterns.
As mentioned, we regard the similarity of IHC staining patterns to the continuous similarity between the two cases.
Metric learning algorithms that use a continuous label or multi-label have been proposed [26,18]. We also define the continuous relevance index between two cases using IHC staining patterns and perform metric learning that utilizes pathological images.
Our contributions In this study, we propose a case-based SIR method that supports malignant lymphoma pathology diagnosis by introducing a DNN model that effectively incorporates MIL and DML. To the best of our knowledge, there is no method that combines these techniques with the case-based SIR method. Overall, the main advantages of the proposed method and our contributions in this study are summarized as follows: • By incorporating attention-based MIL into cased-based SIR tasks, the proposed method can retrieve similar cases based on a similarity measure that depends only on patches in the tumor region.
• By defining the similarity of two H&E stained images using IHC staining patterns in contrastive DML, the proposed method can retrieve similar cases that would have similar IHC staining patterns.
• We applied the proposed method to 249 cases of malignant lymphoma and demonstrated its effectiveness through quantitative evaluation and subjective evaluation by 10 pathologists.

Problem setup
In this study, we denote the set of natural numbers up to c as [c] := {1, . . . , c}. Let N be the number of past malignant lymphoma cases (patients), K be the number of subtypes, and L be the number of the kinds of IHC stains. The entire database of the past cases is represented as one-hot vector for the subtype, and S n is an L-dimensional binary vector for IHC staining patterns. Here, the position of 1 in the one-hot vector Y n indicates the subtype, whereas the values 1 and 0 in the binary vector S n indicate whether or not the corresponding IHC stain was used for the pathological diagnosis.
We develop a case-based SIR method using a database of past cases T , as shown in Fig. 2 in X n and X m , respectively, where "HA" denotes "High-Attention" and we refer to those patches as HA patches. We denote the image patches from cases n and m as x n,i , i ∈ I n , and x m,j , j ∈ I m , respectively. Furthermore, let us denote the feature vectors (which will be learned through DNN representation learning whose details will be explained in Section 3) corresponding to image patches x n,i and x m,j as z n,i and z m,j , respectively. The desirable distance metric for the proposed case-based SIR method between query case m and past case n is defined as Given a query WSI X m , the proposed case-based SIR method retrieves a (or a few) similar case n such that the distance D(n, m) is less than those of the remaining cases 3 .
There are two challenges for learning the desirable distance metric Eq. (1). The first challenge is that the sets of patches in the tumor region I (HA) n and I (HA) m are unknown. As mentioned in Section 1, to overcome this challenge, we employ attention-based MIL, the details of which will be described later. The second challenge is learning the features z n,i and z m,j as a DNN representation such that the distances in Eq. (1) tend to be small when the IHC staining patterns S n and S m are similar. As mentioned in Section 1, to overcome this challenge, we employ contrastive DML, the details of which will be described later. In Section 3, we propose a DNN model and its learning algorithm by effectively combining attention-based MIL and contrastive DML for IHC staining patterns.

Evaluation measure
In the context of DML for classification problems, a common evaluation measure is simply the classification error (the subtype classification error in our problem setup). However, the definitive diagnosis of lymphoma is determined by observing the IHC stained tissues and we consider that the subtype classification error is not sufficient as the evaluation measure for the similarity of H&E stained tissue slides. Because H&E stained tissues of malignant lymphoma are highly heterogeneous, even cases with the same definitive subtype label have quite different IHC staining patterns.
Thus, the similarity of appearances of H&E stained tissue specimens is reflected in the relevance of IHC staining patterns rather than the subtypes. As a quantitative performance measure of the proposed case-based SIR method, we thus employ the Jaccard index for IHC staining patterns. Given a query case m and a retrieved case n, the Jaccard index of their IHC staining patterns S m and S n is defined as follows: For example, we consider the case wherein we use six types of IHC stains CD20, CD30, CD79a, bcl2, bcl6 and MUM1 (in this example, L = 6) and the elements of the multi-label vector S n correspond to [CD20, CD30, CD79a, We verified that the similarity of IHC staining patterns in the form of Eq. (2) is more meaningful measure than subtype classification error for case-based SIR in practical malignant lymphoma pathology by conducting subjective evaluation experiments by 10 pathologists. In the subjective evaluation experiments, given a query case, we present a pair of retrieved similar cases, one of which is selected based on a distance measure trained to comply with the IHC staining patterns, whereas the other is selected using a distance measure trained to minimize subtype classification error, and pathologists answer which of the two cases are more similar to the query case. The details of the subjective evaluations are presented in Section 4.

Attention-based MIL
Here, we describe the basic idea of attention-based MIL [24], which is used as a component of the proposed DNN model in the next section, and its use in the proposed case-based SIR method. In attention-based MIL, we define a bag as a set of image patches randomly sampled from a WSI. The basic idea of attention-based MIL is to assign an attention weight to each image patch, which indicates the relative importance of each image patch within the bag. Let B n be the set of bags in the case n ∈ [N ] and J n,b be the set of image patches in the bag b of case n. Furthermore, we denote a n,b,i as the attention weight of the image patch i ∈ J n,b . The attention weight a n,b,i takes a value in [0, 1], and it is normalized such that the sum of the attention weights in each bag is one, that is, i∈J n,b a n,b,i = 1.

Contrastive DML
Here, we describe the basic idea of contrastive DML [5], which is used as a component of the proposed DNN model in the next section, and its use in the proposed case-based SIR method. The goal of conventional feature learning-based DML is to learn a function that maps an image patch x n,i to a feature vector z n,i for n ∈ [n], i ∈ I n such that the Euclidean distance between the features z n,i and z m,j is small if the cases n and m belong to the same class, that is, As discussed, because we intend to retrieve similar cases that would have similar IHC staining patterns rather than just belonging to the same subtype, we need to incorporate the similarity of IHC staining patterns into the distance metric. Let d(x n,i , x m,j ) := z n,i − z m,j 2 , and consider the problem of learning the distance function d(·, ·) to minimize the following loss function: where G is a hyperparameter that defines a margin between dissimilar image patches 5 (see Eq. (8) in Section 3.1 for the concrete formulation of the distance function d(·, ·)). In Eq. (3), r(n, m) is known as the relevance index in the context of contrastive DML, and we employ the Jaccard index in Eq. (2) as the relevance index for our task. The first term in Eq. (3) works such that image features of two inputs with similar labels are closer together, whereas the second term works such that image features of two inputs with different labels are farther apart based on the margin G. By learning a feature extraction function that minimizes the loss in Eq. (3), we can obtain a distance function d(·, ·) that incorporates the similarity of the IHC staining patterns.

Proposed case-based SIR method
We propose a DNN model and its learning algorithm that provides the desirable distance metric in Eq. (1) for our case-based SIR task by effectively combining attention-based MIL and contrastive DML. The problem of learning the desirable distance metric in Eq. (1) is decomposed into two sub-problems as follows: The first sub-problem is to learn a function for extracting a set of image patches I  , the case n that has the minimum (resp., the 2nd, 3rd, . . . minimum) distance from a query case m is retrieved as the most similar (resp., the 2nd, 3rd, . . . most similar) case.
learn a function that maps an image patch x n,i into a feature vector z n,i for n ∈ [N ], i ∈ I (HA) n , the latter of which is used to measure the distance in Eq. (1). Each of these two functions is obtained as a part of the entire DNN model. Figure 3 illustrates the entire DNN model that consists of four components: f enc , f att , f clf , and f met , each of which is parameterized by a set of learnable parameters θ enc , θ att , θ clf , and θ met , respectively. Each component is described as follows:

DNN model
• Feature extractor f enc : The first component f enc is known as the feature extractor, which is introduced such that the two aforementioned sub-problems have common shared features. The feature extractor is a mapping as follows: where h n,i denotes a feature vector of the image patch x n,i , which is implicitly defined by learning the representation in the DNN model.
• Attention network f att : The second component f att is used to compute the attention weights a n,b,i , n ∈ [N ], b ∈ B n , i ∈ J n,b ⊂ I n , and it is formally expressed as follows: Particularly, the attention weight is computed as follows: where V denotes a matrix of parameters, and w denotes a vector of parameters with appropriate dimensions, that is, θ att := (V , w).
• Classifier network f clf : The third component f clf is used to classify the malignant lymphoma subtype based on the MIL framework (see Section 2.4). In MIL, a bag (a set of image patches randomly sampled from a WSI) , is classified into one of the K subtypes. The input of f clf is the weighted feature vector with attention weights as follows: Given an input u n,b , the subtype classifier outputs the K-dimensional class probability vector P (Ŷ n,b ). Note that constructing a subtype classifier is not the main purpose of this study. By training the subtype classifier in the MIL framework, the attention network f att is trained such that image patches taken from the tumor region have large attention weights.
• f enc and f met as follows: Note that when f met is trained, only parts of the image patches {I

Training DNN model
The parameters of the four components θ enc , θ att , θ clf , and θ met , respectively, for f enc , f att , f clf , and f met are optimized by the alternate algorithm following two minimization problems: where the loss function L c is the standard cross-entropy loss defined as follows: whereas the loss function L d is the contrastive loss function (see Eq. (3)), defined as follows: In Fig. 3, attention sampling extracts image patches, that had the top-M highest attention weights in a bag, during the training of the MIL classification model. Pair-sampling is a process in training the DML model, in which a pair of two image patches are randomly selected from HA image patches extracted by attention sampling. The contrastive DML is trained such that the loss function L d becomes minimum using the calculated distance between two HA image patches sampled by pair-sampling and the relevance of their IHC staining patterns. In our implementation, four components are implemented as follows: We employ ResNet50 [21] as the feature extractor f enc , and it is initialized with the extractor pre-trained with the ImageNet database [10]. The attention network f att is implemented as a softmax operator in Eq. (6). The classifier network f clf is implemented using a simple neural network for K-class classification. The metric network f met is implemented using a simple neural network for feature transformation. The implementation details are described in Section 4.1. of similar image patch pairs is also provided as additional information (see Section 2.2 and Fig. 1).

Case-based SIR based on the trained DNN model
Multi-scale input Because pathologists observe the H&E stained tissue slides under a microscope at different magnifications, it is preferred that the retrieval results are also based on similarity using multi-scale information. In the training phase using multi-scale inputs, different DNN models are independently trained with image patches of the corresponding magnifications. In the testing (retrieval) phase using two magnifications (e.g., 40x and 5x), the distance between image patches of high and low magnifications is calculated similar to Eq. (8) as follows: The embedded image features z low-magnification image input, and multi-scale input of high-and low-magnification images as the magnification of the input image patches (see Fig. 1). We chose 40x/20x and 5x as high and low magnification because they are the magnitudes commonly used by pathologists; 40x/20x is used to examine the detailed shape of tumor cells and 5x is used to understand the overall histology of the specimen, respectively. In a multi-scale setting, both high-and lowmagnification image patches were extracted from the same regions, i.e., a center part of a low-magnification image patch had the same regions as the entire corresponding high-magnification image patch. In our experimental setting, 200 image patches extracted from 100 positions were comprised in a multi-scale bag.
Here, "all patches" represents that attention-based MIL was not employed and image patches were randomly selected, whereas "HA patches" represents that attention-based MIL was used for selecting HA image patches. In "all patches" setting, 1000 image patches were first randomly extracted from each WSI, and the training of contrastive DML was conducted for 100 epochs in which 100 image patches were randomly extracted from the 1000 image patches. The first method "pre-trained ResNet50 + all patches" is a simple baseline in which neither attention-based MIL nor contrastive DML was used, and the distance between two cases was simply measured by the distances between two feature vectors obtained by a pre-trained ResNet50 with the ImageNet database without any fine-tuning. Furthermore, "subtype-based metric" indicates that the relevance index r(m, n) in Eq. (3) for contrastive DML was defined as 1 if the subtypes of the two cases are the same and 0 otherwise, whereas "staining-based metric" indicates that the Jaccard index of IHC staining patterns in Eq. (2) was used as the relevance index. By comparing the proposed method with the first four baseline methods, we demonstrate the effect of selecting HA image patches through attention-based MIL and the effect of considering the similarity of IHC staining patterns through contrastive DML.

Results
One of the main contributions of this study is the utilization of IHC staining patterns to provide a useful similarity measure for heterogeneous malignant lymphoma cases. The performance of the proposed and baseline methods was evaluated not only through a quantitative evaluation but also through a subjective evaluation by 10 pathologists. First, in the quantitative evaluation, the similarity of IHC staining patterns between a test query case and a retrieved similar case were compared among the methods. In the subjective evaluation, we examined whether IHC staining similarity is a more appropriate measure than subtype similarity for the pathological diagnosis of malignant lymphoma. We also compared the similarity of malignant lymphoma subtypes between an input query case and a retrieved similar case in the form of subtype accuracy, which takes 1 if the two subtypes are the same and 0 otherwise. Table 2 summarizes the subtype accuracy results in the same format as Table 1. Although the criterion employed in the proposed method is not directly related to the subtype accuracy measure, the proposed method achieved the best performance among the five methods in two of the three magnification settings. In the magnification setting of 40x, "subtype-based metric + HA patches" achieved the best performance. This is reasonable because the subtype-based metric is directly tailored to subtype accuracy. In terms of the reason why the proposed method with "staining-based metric" was better or comparative to the method with "subtype-based metric," we conjecture that a good representation   The differences between the proposed method and all the baseline methods are statistically significant at the 0.05 level in all magnification settings. Note that the upper bound accuracy of IHC staining patterns was 0.831 in this problem setting.

Magnifications
Pre-trained ResNet50 + all patches 0.540±0.008 0.529±0.008 0.546±0.009 Subtype-based metric + HA patches 0.519±0.008 0.530±0.008 0.531±0.009 Staining-based metric + HA patches (proposed) 0.561±0.009 0.543±0.009 0.565±0.009 External validation As an external validation, we had a retrieval experiment in which the 208 lymphoma cases in Nagoya dataset were used as query cases. In this experiment, all 249 cases in the Kurume dataset are used as the training data and the search database. Because the original WSIs of the Nagoya dataset were scanned at 20x magnification, 20x image patches were extracted from WSIs of the Kurume dataset similarly. In the experiment, accuracies of IHC staining patterns by pre-trained ResNet50, subtype-based metric, and staining-based metric (proposed) were compared and the results are summarized as shown in Table 3. In the table, the average IHC staining accuracies of top-5 retrieved similar cases are listed similarly to the results in Table 1. The results demonstrate that the proposed method has higher IHC staining accuracy even when the institutions, where tissue slides were prepared, were different between query cases and the search database. Although the accuracy degraded compared to the previous experiment, it is caused by the difference in the characteristics of the dataset. The IHC staining patterns in the Nagoya dataset include IHC stains that were selected by general pathologists, whereas the IHC staining patterns in the Kurume dataset were selected only by expert hematopathologists. The Nagoya dataset can have redundant IHC stains in medical records unlike the Kurume dataset, and the upper bound accuracy of IHC staining patterns of the Nagoya dataset was computed as 0.831 whereas that of the Kurume dataset was 0.948. Other causes of reduced accuracy are the difference in the staining condition of H&E stained tissues and the scanning hardware as discussed in a lot of works in digital pathology. We did not consider them owing to the uniformity of the Kurume dataset, but currently, many methods to solve them are proposed such as the stain normalization [53,31] and domain adversarial learning [29,19]. Multi-label classification is originally a difficult problem setting to achieve high accuracy. However, the accuracies of the SIR methods in the external validation are still low and an increment of the accuracy is required for the clinical application as future work.
The above experiments could show the improvement of the accuracies of the IHC staining pattern. The following subjective experiment verifies that the similarity of IHC staining patterns is appropriate for the similarity measure of the appearance of H&E stained tissue images.
Subjective evaluation The goal of the subjective evaluation is to confirm whether IHC staining similarity is a more appropriate measure than subtype similarity for the pathological diagnosis of malignant lymphoma. To this end, we only compared the proposed method "staining-based metric + HA patches" with one of the baseline methods "subtype-based metric + HA patches" in the subjective evaluations. The task of each participant (pathologist) was to evaluate which of the two retrieval results (obtained using the proposed and baseline methods, respectively) was more similar to an input query. An example of the subjective evaluation task is shown in Fig. 4. where the pairs of image patches had the minimum distance d (H,L) in Eq. (13). All 249 cases were used as an input query once, that is, for each input query, the task was to find similar cases from the training (+validation) set for the cross-validation round when the input query was in the test set. The result for each query was evaluated by a 4-grade score; a participant is asked to select one option among the following options: "the result 1 is similar to a query," "the result 1 is weakly similar to a query," "the result 2 is weakly similar to a query" or "the result 2 is similar to a query," where either result 1 or result 2 corresponds to either the proposed method or the baseline method, which is determined at random and shown in blind. In the experiment, all participants were asked to evaluate each case considering the similarity of both 40x and 5x image patches. Note that this is the relative evaluation where the similarity of the result 1 indicates the dissimilarity of the result 2. The order of query cases was also shuffled randomly for each participant.
In total, 10 pathologists composed of three experienced hematopathologists, three standard pathologists, and four pathological trainees participated in the subjective evaluation. Figure 5 shows the results of each of the 10 participants in pie charts. For all 10 participants, the proportion of responses in which the proposed method was more similar (thick blue) or weakly similar (thin blue) to the query case than the baseline method was significantly higher than the opposite responses (thick and thin orange colors). This result indicates that all 10 pathologists determined that IHC staining similarity was more appropriate than subtype similarity as a similarity measure for the pathological diagnosis of malignant lymphoma.
To aggregate the evaluation results, evaluation score was counted as 1 if the result of the proposed method was evaluated as "similar" or "weakly similar," and 0 otherwise. We further compared "confident responses" by removing "weakly similar" responses. Table 4 lists the average evaluation scores of each of the 10 participants. The results demonstrate that the proposed method could retrieve more similar cases in which pathologists felt they were more similar to query cases. The superiority of the proposed method is more evident when we consider only the confident responses. In all the presented results, the difference between the proposed and baseline methods is statistically significant with p < 0.05 based on a randomized test 7 . These results on subjective evaluation demonstrate that IHC Figure 4: Example of the subjective evaluation tasks. The participants were asked to evaluate the result that was more similar to a query case by a 4-grade score. Five image patches of 40x and 5x magnifications that had top-5 attention weights were shown for an input query case, whereas the image patches that had the minimum distance from each query image patch obtained using the two methods are shown. Either "Retrieval result 1" or "Retrieval result 2" corresponds to either the proposed method or the baseline method, which is determined at random. staining similarity is more appropriate than subtype similarity for the pathological diagnosis of malignant lymphoma.

Visualization of attention regions
In case-based SIR, it is desirable to be able to explain the selection of retrieval results as similar cases. To realize such explainable retrieval results, our proposed method provides the attention weights that indicate the regions that were focused as HA image patches in computing case distance D(n, m). The color plots of all WSIs in Fig. 1 show the attention weights. When we compute the attention weights for visualization purposes, attention weights a n,i or a are visualized as a heat map. We observe that the selected patches are visually similar; in particular, hypothesis that the proposed and baseline methods are same. Particularly, we generated 1,000,000 randomized results based on the null hypothesis.
A p-value for each participant is listed in Table 4. Consequently, most of the 1,000,000 results are less than the actual scores listed in Table 4, and all the scores listed in Table 4 are statistically significantly larger than 0.5, with p < 0.05.

Proposed method ++
Proposed method + Baseline method + Baseline method ++ Figure 5: Pie charts for the proportions of the four answers by 10 participants. Each chart corresponds to a different participant, where "++" and "+" in legends mean "similar to a query" and "weakly similar to a query," respectively. It can be confirmed that thick blue and thin blue area are clearly more than a half in all participants, which indicates that the results retrieved using the proposed method were more likely to be evaluated as "more similar." they are quite similar in both low and high magnifications by considering the multi-scale input.

Examples
In the previously described experiments, we confirmed that the proposed method performed better than the baseline methods. We investigate the results of the proposed method that were evaluated as more similar, and the difference of the results of the proposed method and those of the baseline method. Figure 6 shows a histogram of the number of cases for which how many of the 10 participants responded that the proposed method is better than the baseline method in the subjective evaluation in Section 4.2. The horizontal axis represents the number of participants who voted for the proposed method as the more similar result, e.g., "10" shows the number of cases in which all participants voted for the proposed method as "similar" or "weakly similar." In these aggregated results, 143-case results of the proposed method were evaluated as more suitable than the baseline method by the majority of the participants. In total, in 42 cases, the proposed method was evaluated as more similar by all 10 participants, whereas there were only nine cases in which the baseline method was evaluated as more similar by all 10 participants. Figure 7 shows examples of retrieval results where all participants evaluated the proposed method as more similar than the baseline method. In addition to the same image patches as shown in the subjective evaluation, the thumbnails of the retrieved similar cases are also shown to make it easy to confirm whether the two retrieval cases are the same.
In the examples, the lower images show that both the proposed and baseline methods showed the same similar case Table 4: Mean binary scores of all participants through subjective evaluation experiment. The "Score" indicates the results for all cases, whereas the "Confident Score" indicates the results answered with confidence, i.e., by excluding "weakly similar" responses. Each evaluation score was counted as 1 if the result of the proposed method was evaluated as "similar" or "weakly similar," and 0 otherwise. The bracketed numbers indicate the number of "confident responses" by removing "weakly similar" responses. The p-value for each result was computed by the Monte Carlo statistical test with the null hypothesis that the proposed and baseline methods are same, indicating that all the scores are highly statistically significant. Mean±S.E. 0.591±0.0069 0.661±0.023 (but different image patches). Even if both methods retrieved the same similar case, the proposed method could obtain more similar image patches and obtain a better evaluation by all 10 pathologists.

Conclusion
We proposed a case-based SIR method for unannotated large histopathological images of malignant lymphoma. The proposed method with attention-based MIL can automatically extract informative image patches from unannotated WSIs, and it enables a user to input a WSI as a query without the selection of an image patch. Moreover, we employed the similarity of IHC staining patterns as the similarity measure in contrastive DML, where the embedded features of the images that have similar IHC staining patterns are much closer. In the quantitative evaluation of 249 malignant lymphoma patients, we compared the proposed method with several baseline methods, and our proposed method exhibited the highest accuracy in both IHC staining patterns and subtypes between query and similar cases. Furthermore, we conducted a subjective evaluation experiment to verify our proposed similarity measure using IHC staining patterns and confirmed that our method could retrieve similar cases in which pathologists felt more similar in the observation Figure 6: Histogram of the number of cases for which the number of the 10 participants responded that the proposed method is better than the baseline method. In total, in 42 cases, the proposed method was evaluated as more similar by all 10 participants, whereas there were only nine cases in which the baseline method was evaluated as more similar by all 10 participants.
of the H&E stained tissue slide than the baseline method. The proposed case-based SIR method is useful in malignant lymphoma pathology because it provides not only WSIs but also image patches and visualized attention weights that indicate the similarity of the image patches between a query case and a retrieved similar case and the regions of the entire WSI that were focused in the retrieval phase.