Computers in Biology and Medicine

datasets


Introduction
In the supervised learning paradigm, deep learning methods have shown promising performance in a wide range of medical imaging applications.Nevertheless, these methods usually require large amount of data for training, which must be labeled by expert clinicians.Obtaining these labeled datasets is a time-consuming process and is susceptible to inter-annotator variability, which complicates the use of these models in practice.This is the case for histology image analysis, whose large size of tissue images magnified on whole slide images (WSIs), patterns heterogeneity, and the high level of expertise required to annotate the data make this learning paradigm unfeasible.Considering these limitations, the most popular choice in this field has become the use of weakly supervised learning strategies under the multiple instance learning (MIL) paradigm.In particular, typically the training dataset is composed of bags (WSIs) that are known to have cancer or not.Each bag consists of instances (tissue tiles), of which the label is not accessible during training.Under this setting, different works have demonstrated outstanding results for both WSI-level cancer contrast to the binary scenario classification, multi-label MIL literature still remains scarce in histology image analysis [5].
Based on these observations, we propose a novel formulation for MIL in the multi-label scenario, applied to histology prostate cancer grading in WSIs.The key contributions of our work can be summarized as follows: • A novel constrained formulation for instance-level MIL, which integrates an auxiliary term that forces to increase the number of instances classified on positive classes.• In addition, our formulation leverages prior knowledge in terms of relative tissue proportions (i.e.primary cancerous grade in the WSI) by imposing inequality constraints on bag(WSI)-level class proportions.

Multiple instance learning
In computer vision, multiple instance learning (MIL) is a learning paradigm that works with independent images (instances) that form groups (bags), and only bag-level information is known.In the multi-label scenario, each instance belongs to one class, but different classes could coincide at bag level [6].Modern MIL methods using convolutional neural networks (CNNs) for feature extraction usually process each instance independently, and then combine the instancelevel information into one bag-level output.Methods that combine instance-level features are known as embedding-based, which require a subsequent classification layer.In contrast, instance-based architectures combine directly instance-level predictions into the bag classification.Beyond the basic mean and maximum aggregation functions, recent methods have proposed the use of weighted-averaged embeddings, using instance-specific attention weights learned via a multi-layered perceptron projection [7] or recurrent neural networks [1].It is noteworthy to mention that, although embedding-based approaches have yielded slightly better bag-level results in previous literature, they do not provide instance-level probability outputs.In this work, we are interested in both: instance and bag-level classification.Since we aim to include prior knowledge referred to class-wise proportions, our proposed method follows the instance-based learning paradigm.

Constrained classification
Constrained classification aims to guide the training of a CNNs towards a solution that satisfies a given condition, which takes advantage of additional knowledge to the main labels.This learning paradigm has gained popularity on weakly supervised scenarios (e.g.weakly supervised segmentation or MIL), since it allows to incorporate local information to the global annotations.In a usual constraint weakly supervised setting, an additional loss term enforces the sum of the instance-level predictions to match a given proportion using an L 2 penalty [8].Similarly, it has been applied in unsupervised anomaly segmentation, to force attention maps to focus on all patterns of training images [9], or in semi-supervised learning, to match the predicted size distributions to the ones observed in the supervised subset using a KLdivergence term [10].While the aforementioned equality-constrained formulations proposed in weakly supervised settings are very promising, they demand exact knowledge of the prior.For instance, in the case of histology tumor grading, this would require to know the cancerous tissue proportion extent.Therefore, recent works have preferred the use of inequality constraints to relax the prior assumptions, allowing more flexibility.This approach allows, for example, to set some tolerance margins on target size using L 2 penalties [11,12], or Lagrangian optimization [13].Following the example above, these works would require approximate knowledge of tumor size, and a tolerance margin would be applied to smooth the constraint.Unlike these works on weakly supervised classification, our formulation does not require prior information on the absolute size of the target.In contrast, we seek to constrain the training to account for relative relationships between proportions within the same global image.In the case of histological whole slide image classification in a multi-label setting, this formulation incorporates information about which tumor grade is in the majority (primary) and which is in the minority (secondary), so that the proportion of the primary grade must be greater than that of the secondary grade.Thus, we use inequality constraints to (i) encourage classification of instances to positive classes at the bag level, and (ii) incorporate relative relationships between class proportions within bags.

Methods
An overview of our proposed method is depicted in Fig. 1.In the following, we describe the problem formulation, and each of the proposed components.

Problem formulation.
In the paradigm of Multiple Instance Learning (MIL), instances are grouped in bags of instances  = {  }  =1 , that exhibit neither dependency nor ordering among them, and its number  is arbitrary for each bag.In the multi-label scenario, there are multiple labels per bag,  = ( 1 , … ,   , … ,   ), where  ∈ {1, … , } denotes each one of the  categories.Also, individual labels,  , ∈ {0, 1}, exist for each instance in the bags, but they remain unknown during training.In the standard MIL formulation, a bag label is considered positive if at least one instance in the bag is positive for that category.We can rewrite this assumption in the following forms: ).Then, the optimization of  is driven by the minimization of cross entropy loss between reference and predicted bag-level score.
1 Based on the denomination proposed in [7] Fig. 1.Method overview.In this work, we face weakly supervised histology image classification under the Multiple Instance Learning (MIL) paradigm.Each biopsy is a bag, while its patches are the instances conforming it.In the case of prostate analysis, expert labels are conformed by the Gleason score, that are the sum of the two most predominant tumor grades (i.e.G3, G4 or G5).In order to extract both instance and bag-level labels, an standard instance-level MIL with max aggregation is trained via cross-entropy loss,   (see Eq. ( 3)).Then, prior information is incorporated via inequality constraints that (i) force the classifier to predict instances that are present in the biopsy (   , see Eq. ( 5)), and (ii) ensure that the proportion of the primary grade is superior than the secondary grade (   , see Eq. ( 7)).Colored tissue indicates: blue: Gleason grade 4; red: Gleason grade 5. Circles in instance-level predictions indicate soft-max scores,  , .The more intense the color, the higher the score.

Inequality constraints for MIL
Previous literature on instance-level MIL have proposed aggregation functions   (⋅) based on mean or maximum operator.The second solution is used based on the direct interpretation of maximum operation on MIL formulation (Eq.( 2)).Nevertheless, training a neural network via this aggregation produces well-known problems such as gradient vanishing of non-maximum instances.This limitation produces the network to focus only on discriminative instances during training, which leads to poor generalization performance on unseen samples.To alleviate this issue, we focus on the MIL formulation in Eq. ( 1), which interpretates a positive bag via an inequality that forces the sum of instances scores to be greater than zero.In this line, we incorporate to the base instance-based MIL training a term that increases the proportion of positive instances classification for a given class ,   = 1  ∑   ℎ , , by minimizing −(  ).Nevertheless, this log-term is non-differentiable when   → 0. To solve this limitation we resort to a smooth, dualitygap bound approximation.Concretely, we use the formulation proposed in [13] on constrained optimization that models inequality constraints using the approximation of log-barrier that is formally defined as: where  controls the barrier during training, and  is the objective term.This log barrier extension is applied on the proportion term   of the bags that are positive for the class  at bag level (i.e.  = 1).It is noteworthy to mention that this proportion is the objective term  in Eq. ( 4).Hereafter, we refer to this term as positives expansion (PE) constraint.
Thus, we propose a MIL loss that combines the maximum formulation in Eq. ( 2) via the aggregation function   (⋅) = max  { , }, and the PE term as follows: (6) where    ∈ R + weights the importance of each term during training.Note that the positives expansion term,    , is only applied for those positive categories at bag-level.

Incorporating proportion information
In some applications, prior knowledge of the bags is known.In this work, we focus on an information usually recorded on medical domains: data regarding the proportion of categories in the image (i.e.primary or secondary tumor grades in the tissue).This information can be formulated as an inequality constraint between categories proportions such that:   ′ >   ′′ , where  ′ denotes the larger proportion category, and  ′′ its respective counterpart.Note that this relation can be established between any pair of positive categories in the bag for which we have this information available.Thus, we contemplate an arbitrary number of conditions  for each bag, which could give complete or partial information (i.e. the formulation could be applied for only few known inequalities).For each condition , both major ( ′ ) and minor ( ′′ ) categories should be indicated.Again, we make use of extended log-barrier (see Eq. ( 4)) to solve this inequality constraint, which has demonstrated good performance when multiple constraints are used [13].In this case, the objective term  in Eq. ( 4) is the different between major and minor proportions in a given bag: ). Hereafter, we refer to this additional term as proportion constraint (PC).
where  indicates the bag index over the complete dataset,    ∈ R + weights the relative importance of the proportion term during training,    controls the barrier slope over time.It is noteworthy to mention that the proportion term is not taken into account for bags with only one positive category, or which the proportion information is unknown.
Taking into account the different terms previously detailed,  is trained to solve the multi-label MIL formulation using the following optimization criteria via standard Gradient Descent:

Experimental setting
Datasets.In this work, we present a new dataset for prostate histological image analysis: SICAP-MIL. 2This dataset is an extension of the previously published SICAP versions [14,15], which is expanded with 168 new WSIs.The dataset introduced is composed of 350 WSIs from 271 patients.The samples were digitized using the Ventana iScan Coreo scanner at 40 magnification.The slides were analyzed by a group of expert urogenital pathologists at Hospital Clínico of Valencia, and a combined Gleason score (GS) was assigned per biopsy.The Gleason score is the sum of the two main (primary and secondary) Gleason grades (GG) in the biopsy regarding its extent and severity.The clinical report specifies both the score and the primary and secondary grades that constitute the score.SICAP-MIL is specially design to serve as a benchmark for MIL methods.Each WSI is considered as a bag, from which instances are obtained by tiling the images using non-overlapped moving-windows of 512 2 pixels at 10× of resolution level.Note that tiles with less than 20% of tissue were excluded.The dataset is divided into three class-wise balanced groups for training, validation and testing.A summary of the dataset in terms of the labeled Gleason scores and proposed partitions is presented in Table 1 From the WSI-level Gleason scores, bag-level labels referred to the presence of each Gleason grade in the WSI are inferred.Also, the relative-proportion information of the primary and secondary grades is obtained from this score.We show in Fig. 2 the information regarding the primary and secondary Gleason grades for each WSI.It is observed that most cases present at least two tumor types, and thus two proportion expansion (PE) constraints and one proportion constraint (PC) in the proposed formulation.Also, the difficulty of training a classifier capable of distinguishing between different Gleason grades in a weakly supervised manner is appreciated, since the biopsy rarely presents a single tumor type.In addition, SICAP-MIL includes instance-level annotations, which allow to test the capability of MIL methods to leverage instance classifications in a weakly-supervised manner.To do so, annotated WSIs are kept into the test subset.Note that instance-level labels are obtained from pixel-level annotations done by expert pathologist.Non-cancerous patches are obtained only from benign WSIs, while cancerous patchlevel labels are obtained by majority voting of segmentation masks.The distribution of instance-level annotated subset from the test cohort is presented in Table 2.
Implementation details.The proposed methods were trained using the train subset from SICAP-MIL.The backbone   (⋅) used was a VGG16 [16] pre-trained on Imagenet [17], which takes as input instances resized to 224 × 224 images.First, the PE setting was trained by empirically fixing    = 0. Instance-level student .In this work, we complement the proposed models for instance-level prediction with a second model, Student, trained with instance-level hard pseudo-labels as described in [2].This second stage has demonstrated to increase model performance without any modification of the architecture as described in [2].Note that we use as Teacher any trained instance-level classifier   (⋅) under the MIL paradigm with the proposed methodology.A Student model with the same complexity as the Teacher is trained following the Noisy Student paradigm on semi-supervised learning [18].Concretely, a dropout rate of 0.20 is applied over the instance embedding, and data augmentation is applied to all instances using random rotations, translations, Gaussian blur and color jittery.Student is trained during 60 epochs with minibatches of 32 images using SGD optimizer and a learning rate of  = 1 ⋅ 10 −2 .
Baselines.With the aim of comparing our approach to state-of-the-art methods, we implemented and tested prior methodologies on MIL for both instance-level and bag-level classification on SICAP-MIL dataset.Instance-based MIL.First, we compare our method with other instance-based MIL aggregation.Concretely, we use basic mean and max operations over the instance-level predictions to obtain the bag-level prediction.Embedding-based MIL.Secondly, we included embeddingbased methods, which aim to obtain a bag-level embedding, on which a classifier is trained to predict bag-level labels.Aggregation methods of instance-level features include mean, max, attention mechanism, and recurrent neural networks (RNN).AttentionMIL [7] aims to obtain a weighted feature representation, which highlights positive instances in the bag.The weights are obtained using a multi-layered perceptron as detailed in [7].We implemented the gated attention mechanism with Only the learning rate of the methods based on attention mechanisms was changed to  = 1 ⋅ 10 −3 .Note that embedding-based method do not make instance-level predictions, and is therefore only used as a comparison of the results at the bag level.Although attentionbased methods include instance-level importance weights, these are not true predictions at the instance level, as they are sensitive to the number of instances in the bag.
Evaluation metrics.We evaluate the different models in this work using standard metrics on MIL for both instance and bag-level performance on the test subset.Concretely, for instance-level validation we obtain accuracy (Acc), and f1-score per class and micro-averaged.Also, as the Gleason grades constitutes a set of ordered classes, we obtain Cohen's quadratic kappa () as figure of merit.Regarding the baglevel predictions, we evaluate them using the area under ROC curve (AUC).In the multi-label scenario, AUC is obtained class-wise, and it is averaged (mAUC).In order to facilitate the comparison of our methods with previous literature at the bag level, we also obtained the AUC for binary cancer vs. non-cancer detection by combining each class prediction and target via max-aggregation.For each experiment, the metrics shown are the mean of three consecutive repetitions (with its respective standard deviation) of the model training, to account for the variability of the stochastic factors in the process.

Results
Comparison to the literature.The quantitative results obtained by the proposed model and baselines on the test cohort are presented at instance level in Table 3, and at bag level in Table 5 and Fig. 3. Also, we include results reported in a relevant body of literature for both tasks, using different datasets and experimental settings for instance level in Table 4, and at bag level in Table 6.Instance-level results.The proposed constrained formulation using a positive expansion constraint term (PE) to enhance positive instances prediction outperforms in ∼ 5% the accuracy for instance-level classification of max-aggregation baseline.Adding the Student stage, the model reaches an accuracy of 0.610, which outperforms on SICAP-MIL the Teacher-Student strategy using only max aggregation in [2].The observed improvement could be caused by the larger number of instances classified using the inequality constraint, which avoids over-fitting the model to focus only on very discriminative instances.Note that, although still the results reported in [2] in prior literature are better, the training dataset required to accomplish these results is too large: around 10,000 WSIs.Once we introduce the proportion information in terms of primary and secondary classes in the bag via the proportion inequality constraint (PC), results reach an accuracy of 0.705 and average F1-score of 0.655.It is noteworthy to mention that these results are similar to the ones obtained in prior literature under full supervision on similar sized datasets [15,[19][20][21].Under our proposed formulation, the model is capable of grading cancerous patches at the same performance of using pixel-level annotated datasets, by providing only WSI-level information about the most abundant grade.
Bag-level results.Regarding the MIL bag-level results obtained, our PE formulation improved around ∼ 0.7% the baseline instance-based

Table 6
Quantitative comparison to prior literature at bag level.Results reported on different datasets, patch size and resolutions.The metric presented is the Area Under ROC curve (AUC).Results derived from the proposed methods in gray.WSI: whole slide image [22,23].

Method
Training WSIs

Cancer Detection Multilabel
Campanella et al. [ maximum aggregation.This modest improvement may be due to the fact that, because of the maximum-based inference, it is only necessary to locate one positive sample to get the bag-level prediction right.These observations are in line with previous literature, which highlights that the best classifier at the bag level need not be the best classifier at the instance level [24].Once we incorporate the proportion information during training, the proposed model increases the multilabel mAUC in ∼ 3.3% from the baseline, and reaches mAUC of 0.899 in the multi-label scenario and 0.979 in the binary prediction (see Table 5).Note that this result almost reaches the ones reported in previous literature (see Table 6), which use thousands of WSIs during training.However, it is worth noting the limitations of this indirect comparison.The methods used in previous works may have different levels of supervision, and the datasets used are larger.Next, we perform a direct comparison of the weakly supervised methods in the database used in this work, SICAP-MIL (see Table 5).Specifically, we pay attention to embedding-based methods performance at bag level.The obtained results using mean and max aggregation are similar to the baseline instance-based max approach.However, in the multi-label scenario, these methods perform worse.Moreover, since they cannot provide instance-level labels, they cannot take advantage of the information referred to the proportion during training.It is notable that deep-learning based aggregation modules such as AttentionMIL or RNN do not perform properly in this training setting.This could be due to the complexity of having multiple classes in some bags, the over-fitting tendency of neural networks, and the incapacity of AttentionMIL to get class-specific attention weights.Finally, We would like to point out that a significant body of previous work validates multi-class methods at the bag level on the basis of Gleason scores.However, this score is beyond the scope of MIL.Its derivation involves a decision making according to the severity of the grades in the tissue by the clinical expert, which does not fit a proper formulation of MIL (see Eq. ( 1)), based on the presence of each class in the bags of instances.
Ablation studies.In the following, we provide comprehensive ablation experiments to validate several elements of our model, and motivate the choice of the values employed in our formulation, as well as our experimental setting.First, we optimized the proposed formulation only with the inequality constraint term in Eq. (6).Using the training setting previously described, validated different values of    = {0.01,0.1, 1} and slopes of the log-barrier inequality    = {1, 5, 10, 15}.Using the mAUC on validation subset as an early stopping criteria, we obtained bag-level mAUC from the validation subset and instance-level accuracy from the test cohort.Results are presented in Fig. 4.These show that the inclusion of the PE term improves both the performance at both baglevel and instance-level under most of the settings.Thus, we selected    = 15 and    = 0.1, which led the best results at bag level in the validation cohort.
Then, using the best configuration reached for the PE term, we optimized the proportion constraint configuration (PC) in Eq. ( 8).During empirical experimentation, we appreciated that the instancelevel model performance on the test subset did not always correlate with the bag-level performance on the validation or test cohort when applying early stopping based on mAUC metric.As the proposed PC loss term provides information about the correct prediction of proportions, we evaluated this term as an early stopping criterion.Thus, we also kept track of the epoch average of Among the full range of hyperparameter values, the ones that showed best stability during training were    = {0.1,1} and    = {1, 5, 10}.We show the results obtained at bag-level and the instance-level accuracy on test cohort, as well as the proportion constraint satisfaction on the train subset for both early stopping criterion in Fig. 5.
The figures of merit indicate that the criterion based on constraint satisfaction (dashed lines) consistently outperforms the validation mAUC criteria (solid line) at both instance and bag level for all J. Silva-Rodríguez et al.  settings.This could be explained by the possibles bias introduced using the validation subset due to class imbalance.Likewise, maximizing the difference in proportion between the majority and minority classes can help to better distinguish between them.The results obtained are in line with these observations, since lower values of    seem to obtain better results.Due to the formulation of the barrier extension (Eq.( 4)), low values of t contribute not only to fulfill the constraint, but also to maximize it by using a slope proportional to 1∕.Therefore, we selected the setting that gives the largest proportion of difference between the primary and secondary grade on the train cohort:    = 5 and    = 1.
Qualitative evaluation.Finally, we want to get a more intuitive view of how the different terms of the proposed methodology are influencing the extraction of discriminative features.For that purpose, we depict the feature representation of the embedding space produced by the encoder networks on the instance-level labeled test cohort using the t-sne [25] in Fig. 6.Concretely, we obtained the two-dimensional t-sne embedding using a perplexity value of 40, and 300 iterations.The t-sne representation is obtained on the instance-max setting 6(a), instancemax with PE term 6(b) and instance-max with PE and PC terms 6(c) after Student model training.
Features obtained using the basic max aggregation are quite overlapped on the cancerous classes.Although the PE term slightly improves this condition, only once the PC term is included it is possible to distinguish class-wise clusters between Gleason grades 3 and 4.These grades tend to coincide in WSIs, with Gleason score 7 (whole slide images that include both tumor growth patterns of grade 3 and 4) being the most common in the database used (see Table 1 and Fig. 2).This fact produces noise during training, as many bags are positive for both classes simultaneously, making it difficult to distinguish between the two types of instances.However, when we introduce the relative proportion information of both classes during training, this facilitates the network to promote a distinction between them.
Also, we introduce in Fig. 7 visualizations of the obtained instancelevel classifications, compared to pathologists annotations and baselines.Instance-level predictions are performed on the test subset biopsies using an overlap of 75% between instances, to gain spatial resolution.Then, the instance-level scores are assigned to each pixel of the patch, and they are averaged among the overlapped patches.From the selected representative examples, it is observed how once the different proportion constraints are introduced, the model is able to differentiate best between the different Gleason grades (first and second rows), and locates more cancerous regions (third row).

Conclusions
In this work, we have presented a novel constrained multi-label instance-based MIL formulation that encourages the network to focus on many positive instances, and allows to impose restrictions about relative proportions of class size within the bag.In particular, we combine a standard instance-based max aggregation with additional inequality constrains terms via a flexible log-barrier extension.We validate the proposed formulation on a new publicly available dataset of prostate histology cancer WSIs images, SICAP-MIL.In the experimental stage, our method shows that forcing the network to classify more positive instances, the results improve in ∼ 5% at instance level classification accuracy.By simply incorporating relative proportion information about the primary grade in the WSI, which is usually easily accessible from medical records, our method reports improvements of ∼ 9% accuracy at instance level, and ∼ 3.3% mAUC at bag level.In addition, the target relative proportion difference between primary and secondary classes in the bag has proven to be a good criterion when optimizing the model, obtaining more generalizable results than using the mAUC at the bag level.The obtained results are comparable to prior works using similarly-sized datasets under the supervised paradigm, which require tedious instance-level annotations.
1 and    = 15.Training was carried out during 100 epochs using a batch size of 1 bag and the SGD optimizer with a learning rate  = 1 ⋅ 10 −2 .After 50 epochs,  is decreased in a factor to 10×.During training, bag-level mAUC is monitored in the validation set, and early stopping is applied if this figure of merit does not improve during 20 epochs.Then, the PC formulation is trained keeping constant the PE hyperparameters, and empirically setting    = 1 and    = 5.The training is carried out using the same training conditions as the PE setting.Nevertheless, instead of using mAUC from validation subset as early stopping criterion, we use the average proportion constraint satisfaction,  =  , ′  −  , ′′  in Eq. (7) from the training set to determine the best model.The hyperparameters and early stopping criterion used are further justified by means of ablation experiments.The code and trained models are publicly available on https://github.com/jusiro/mil_histology.

Fig. 4 .
Fig. 4. Ablation studies on positive expansion (PE) MIL formulation.Hyperparameters study for    and    are performed for bag-level mAUC on validation set (a), and instance-level accuracy (b).

Fig. 5 .
Fig. 5. Ablation studies on proportion constraint (PC) MIL formulation.Hyperparameters study for    and    are performed for bag-level mAUC on test set (a) and instance-level accuracy on test set (b).Also, two early stopping criterion are validated: mAUC on validation set (solid lines) and proportion constraint satisfaction    (dashed lines), which values are illustrated in (c).

Fig. 6 .
Fig. 6.Visualization of the embedding space produced by baselines and the proposed method models on the labeled instances from SICAP-MIL test cohort.(a) instance-max; (b) instance-max w.PE; (c) instance-max w.PE w.PC.Red: non-cancerous; light blue: Gleason grade 3; dark blue: Gleason grade 4; purple: Gleason grade 5.

Fig. 7 .
Fig. 7. Visual examples of the proposed model performance on instance-level prostate cancer grading.In particular, the pathologists annotations are depicted with the instance-based MIL baseline using max aggregation, and the results when we introduce the proportion priors.In green: Gleason grade 3; blue: Gleason grade 4; red: Gleason grade 5.

•
We benchmark the proposed model against a relevant body of literature on SICAP-MIL, a new publicly available dataset containing 350 prostate WSIs with global labels, as well as instance-level labels to test weakly-supervised methods on tumor localization.• Comprehensive experiments demonstrate the superior performance of our model.By simply incorporating relative proportion information during training (easily accessible from medical records in many cancer types) we found improvements of nearly ∼ 3% in mean AUC for bag-level classification and ∼ 13% for instance-level cancer grading accuracy compared to prior MIL methods.

Table 1 SICAL
-MIL dataset.Whole slide images partition and Gleason scores (GS) distribution.NC: non-cancerous.Fig.2. SICAP-MIL dataset description.The confusion matrix shows the distribution of global labels in terms of primary and secondary Gleason grades per Whole Slide Image.GG: Gleason grade.NC: non-cancerous.

Table 2
Datasets with patch-level Gleason grade annotations used for testing.Distribution of the patches among non-cancerous (NC) and the different Gleason grades (GG).

Table 3
Quantitative comparison to prior literature at instance level on SICAP-MIL dataset.Results derived from the proposed methods in gray.Best results in bold.NC: non-cancerous; GG: Gleason grade; : Cohen's quadratic kappa.

Table 4
Quantitative comparison to prior literature at instance level.Results derived from the proposed methods in gray.TMAs: tissue micro arrays; WSIs: whole slide images; NC: non-cancerous; GG: Gleason grade; : Cohen's quadratic kappa.
[1]esults reported on different patch size and resolutions, on private datasets.anintermediatelayerwith  = 128 neurons.Campanella et al.[1]proposed a RNN based aggregation over the top-k positive instances of each bag to produce bag-level classifications.We increased  = 10 to support the multi-label scenario, and a RNN with a hidden state of 128 neurons was trained.All methods are train under the same training setup (i.e.backbone, learning rate, scheduler, batch size, etc.) as our baseline.

Table 5
Quantitative comparison to prior literature at bag level in SICAP-MIL dataset.The metric presented is the Area Under ROC curve (AUC).Results derived from the proposed methods in gray.Best results in bold.
Overall receiver operating characteristic (ROC) curves for the multilabel bag-level prediction of proposed methods and baselines on SICAP-MIL dataset.