Uncertainty-Aware Multiple-Instance Learning for Reliable Classification: Application to Optical Coherence Tomography

Deep learning classification models for medical image analysis often perform well on data from scanners that were used during training. However, when these models are applied to data from different vendors, their performance tends to drop substantially. Artifacts that only occur within scans from specific scanners are major causes of this poor generalizability. We aimed to improve the reliability of deep learning classification models by proposing Uncertainty-Based Instance eXclusion (UBIX). This technique, based on multiple-instance learning, reduces the effect of corrupted instances on the bag-classification by seamlessly integrating out-of-distribution (OOD) instance detection during inference. Although UBIX is generally applicable to different medical images and diverse classification tasks, we focused on staging of age-related macular degeneration in optical coherence tomography. After being trained using images from one vendor, UBIX showed a reliable behavior, with a slight decrease in performance (a decrease of the quadratic weighted kappa ($\kappa_w$) from 0.861 to 0.708), when applied to images from different vendors containing artifacts; while a state-of-the-art 3D neural network suffered from a significant detriment of performance ($\kappa_w$ from 0.852 to 0.084) on the same test set. We showed that instances with unseen artifacts can be identified with OOD detection and their contribution to the bag-level predictions can be reduced, improving reliability without the need for retraining on new data. This potentially increases the applicability of artificial intelligence models to data from other scanners than the ones for which they were developed.


Introduction
Deep learning models for medical image analysis applications are often trained on data that is acquired with one or a selected number of scanner types and/or acquisition protocols.When applying these trained models on data from Volume-level probabilities (darker is higher) Figure 1: Overview of the method.Each MIL instance is fed into the same classifier.During inference time, a UBIX function converts pre-UBIX instance-level logits based on their respective uncertainties to post-UBIX logits, which are reduced for OOD instances.The instance-level post-UBIX logits are then converted to bag-level outputs using MIL pooling.During training, the pre-UBIX logits are fed directly into the MIL pooling function.different scanners or protocols, the performance tends to plummet ( [1,2]).This negatively affects the reliability of these systems, which is a main aspect of trustworthy AI ( [3,4]), and its wide integration and adoption in clinical practice.In general, convolutional neural networks (CNNs) are known to fail when they are applied under dataset shift or to out-of-distribution (OOD) datasets; and approaches to address this effect are being investigated ( [5]).This OOD nature of the data occasionally only stems from local areas in images, such as local artifacts.These local artifacts occur frequently in data from specific vendors or particular scanning protocols, and can be found in multiple medical imaging fields, such as contrast-enhanced mammography data ( [6]) and optical coherence tomography ( [7]).In images with these types of artifacts, there generally are sufficient parts in a sample which are in-distribution (ID) to form a correct prediction if the model would in some way only focus on those parts of the data and neglect the OOD areas.
To achieve this increased robustness to local OOD areas in images, we propose Uncertainty-based Instance eXclusion (UBIX).This approach builds upon multiple-instance learning (MIL), a form of weakly supervised learning popular in medical image analysis ( [8], [9]).In MIL, a labeled bag (usually the whole input image) consists of multiple unlabeled instances (usually image patches, regions or slices).During deep MIL, instances are considered individually by a neural network and the instance-level outputs are subsequently combined, equally contributing, to obtain a baglevel prediction using a MIL pooling function ( [10]).Instead of assuming equal contribution, the UBIX approach assumes that some of the instances might be corrupted due to local artifacts, identifies these instances on-the-fly using uncertainty estimation, and reduces or ignores its contribution to the bag-level prediction using the so-called UBIX function before the MIL pooling function (see Fig. 1).To the best of our knowledge, this is the first method that uses OOD detection in such a manner to increase reliability.
Although our method is applicable to any instance definition (such as 2D or 3D patches) or bag definition, we focus on MIL problems in which slices are instances and full 3D volumes are bags.Specifically, we focus on classifying age-related macular degeneration (AMD) in optical coherence tomography (OCT), as there is a plurality of manufacturers and scanner versions in the field of OCT.[11] list fourteen companies that produce OCT scanners for ophthalmic applications and the type of imaging artifacts that occur can differ substantially across scanners ( [7]).These artifacts include slices (B-scans) that are fully black due to blinking, vertically flipped B-scans, shadows and noise.For example, blinking artifacts are much less common in certain scanners, specifically ones with higher speed and eye tracking software ( [7]).
We evaluate the generalizability of our proposed models by training on data acquired with a scanner from one vendor, while evaluating with data from scanners of other vendors.We show that UBIX increases this generalizability using an ablation study.Moreover, we systematically analyze the ability of UBIX to detect OOD instances by gradually introducing artificial image artifacts that occur naturally as well.The trained algorithm is publicly available for inference on the online platform of Grand Challenge1 .

Multiple-instance learning in medical imaging
One of the most common medical application in which MIL is applied is histopathology ( [12], [13], [14], [15]), mainly because it is very labor-intensive and time-consuming to manually annotate entire whole slide images on instance-level.[9] used an attention-based MIL pooling layer and evaluated it on an MNIST-based dataset and histopathology datasets.Other medical modalities to which MIL with deep learning has been applied include ultrasound ( [16,17]), computed tomography ( [18,19]) and magnetic resonance imaging ( [20,21]).Unlike our proposed method, these approaches suffer of low reliability when transferred to other distributions.

Out-of-distribution detection
OOD detection is the identification of samples that originate from a different distribution than the training distribution.Such samples generally have high model predictive uncertainties, given a good uncertainty estimation method.[22] proposed a simple baseline for OOD detection using the maximum class probability as confidence scores.Another early work was Monte Carlo dropout (MC-DO), in which they leveraged dropout to estimate uncertainty [23].[5] compared a number of methods for OOD detection and uncertainty estimation including MC-DO.They found that deep ensembling ( [24]) was one of the top-performing methods for OOD detection.Since then, other popular methods for uncertainty estimation and OOD detection have been published ([25, 26, 27]).
Uncertainty estimation has been investigated for medical image analysis as well, such as [28], who used deep ensembles to calibrate probabilities in segmentation maps.[29] detected incorrect orientation or anatomy in X-rays using an OOD detection metric called FRODO, defined as the Mahalanobis distance of test samples to samples in the train set.Furthermore, [30,31] used multi-head CNNs, an approach similar to deep ensembles, to detect images with lymphoma in histopathology as OOD samples.
In general, uncertainty estimation in medical images are used as an additional output to assess the behaviour of the developed models or to identify abnormalities as OOD samples.In contrast, our proposed approach takes into account OOD detection during inference to increase classification robustness against data shift.

Robustness against data shift in OCT
Related works have successfully applied machine learning methods for AMD classification from OCT, but did not specifically focus on robustness to OOD data ( [32,33,34,35,36,37]).
The following works studied robustness against data shift in OCT.[2] used an OCT segmentation network of which the output was fed into a classification network, which in turn outputted a referral suggestion, diagnosis probabilities for multiple retinal disease features, such as choroidal neovascularization (CNV) and geographic atropy (GA), and volume estimations of drusen and epiretinal membranes.The error rate on their internal test set with OCTs from the same scanner, as the development set, i.e.Topcon, was 5.5%, but the error rate increased to 46.6% when transferred to an external set with OCTs from a different scanner, i.e.Heidelberg Spectralis.When retraining their segmentation network with data from this scanner, the error rate improved to 3.4%.[38] and [39] used a CycleGAN to transform OCT scans acquired on a device that was not used during training to have a similar appearance as the training data.For retinal fluid ( [38,39]) and layer ( [39]) segmentation, they observed a generalizability improvement when applying this domain adaptation technique, compared to traditional transformation strategies.
The main downside of these methods is their requirement for -albeit annotated or not -data from the new setting.Acquiring and annotating this new data, as well as any potential retraining, is a time-consuming and expensive process.Moreover, if these models are unknowingly applied in settings that are highly different from the development setting, models can fail silently, potentially causing misdiagnoses.We propose a method that reduces the performance drop when a model is transferred to a setting unlike its development setting, without the requirement for acquiring or labeling data originating from this new setting.

Methods
In this section, we first introduce UBIX, a method that reduces the effect of corrupted instances on the bag classification by seamlessly integrating an OOD instance detection technique during inference (Section 3.1).The pipeline of the method is shown in Fig. 1.Subsequently, we introduce uncertainty estimation techniques used to identify OOD instances, including a novel ordinal uncertainty estimation technique, tailored to the classification of ordered classes (Section 3.2).

UBIX during training time
Let X = {x 1 , . . ., x I } be a bag consisted of a set of instances x i , and Y ∈ {1, . . ., C} the bag label.We assume the number of instances in the bag, I, can vary between bags and each instance x i in the bag also has an associated label y i ∈ {1, . . ., C}, which is not available.In this paper, we work with a staging problem, in which the classes are ordered, so we make the following assumption: We define an instance-level classifier which is an ensemble of neural networks, f θm (•) with parameters θ m for each network in the ensemble m ∈ {1, . . ., M }.Each network transforms instances x i to logits h i,m,c for each class 1, . . ., C such that h i,m,c ∈ R. Instance-level probabilities can be obtained with the softmax function: A MIL-pooling converts instance-level logits to bag-level logits.Following the assumption in Eq. 1, we define the MIL-pooling function as: where z m,c is the bag-level logit for model m in the ensemble and class c.The bag-level probability for class c can be calculated using:

UBIX during inference time
Each instance x i has an associated uncertainty estimate U i .In Section 3.2, we present various approaches of calculating this instance-level uncertainty.
In UBIX, during inference, instance-level logits are modified using the UBIX function g to consider the associated uncertainty: where δ is a hyperparameter that determines the smoothness of g and χ c is: where X val is the validation set.
Fig. 2 shows a plot of this function.γ is the value for h i,m,c where the slope of g is the steepest, defined as: where γ ∈ [0, 1] is a hyperparameter, U min is: and: During inference, the MIL-pooling function takes as input g(h i,m,c ) instead of h i,m,c : A special case of UBIX, where essentially the instances with an uncertainty exceeding a threshold τ are excluded from the bag, is the one for which δ = ∞.The bag-level prediction is then calculated with this pruned bag.τ is a hyperparameter which can be optimized using the validation set.For this UBIX variant, the UBIX function is a step function.

Instance-level uncertainty
The uncertainty U i of the instance x i in a bag X is estimated using deep ensembling, as introduced by [24].We do not employ adversarial training, in contrast to what was done by [24].Uncertainty estimates can be obtained from deep ensembles by combining the individual model outputs.This can be done in various ways, resulting in different uncertainty measures.Commonly used uncertainty measures are entropy ( [5,40]), variance ( [23,40]) and maximum class probability ( [22,5,40]).When using these measures for ordinal classification problems, such as problems with a staging scale, the uncertainty between similar stages is quantified equally as the uncertainty between classes stage that are very far apart.To solve this, we introduce two ordinal variants for variance and entropy that take the order of classes into account.Specifically, these measures will be higher if the uncertainty is larger between two classes that are far apart than two classes that are closer on the ordinal scale.For example, if the probabilities of class 1 and 2 are both 50%, and so the other three classes are 0%, the uncertainty will be lower than if the probabilities of class 1 and 5 are both 50%.Table 1 shows how these uncertainty measure are calculated for deep ensembles.
Mean class variance Ordinal variance , where Ordinal entropy Three different data sets from three different vendors were used to develop and evaluate the proposed solution: a dataset with Heidelberg OCTs served as a training set, referred to as H train , a validation set, referred to as H val , and internal test set, referred to as H test (Section 4.1.1);and two external test sets were used to evaluate the generalizability of our models, one with Topcon scans, referred to as T test , (Section 4.1.2) and one with Bioptigen scans, referred to as B test (Section 4.1.3).Manual grading of the scans was performed by the Cologne Image Reading Center and Laboratory (CIRCL).They categorized the OCTs using the criteria described in Table 2. Samples with grade 6 and 7 were excluded from this study.The number of OCTs for each of the remaining five stages is shown in Table 3.The OCTs were acquired with a Spectralis HRA+OCT (Heidelberg Engineering, Heidelberg, Germany) scanner.We resampled all B-scans to the same pixel spacing of 13.9 µm × 3.9 µm.The number of B-scans in each OCT scan was left unchanged, which varied from 14 to 73.

Topcon dataset (T test )
One of the external test sets was derived from the Rotterdam Study ( [43]).This is a prospective cohort study in the city of Rotterdam, the Netherlands, that started in 1990 to investigate age-related diseases.There were in total 1184 OCT scans available from this dataset, originating from 713 patients.All OCTs were graded using the Wisconsin Age-related muclopathy grading system (WARMGS) ( [44]) and manually harmonized to the CIRCL grading system.The number of OCTs for the resulting four classes is shown in Table 3.The OCTs from this dataset were taken with an OCT scanner from Topcon Corp., Tokyo, Japan.Each OCT volume contained 128 B-scans.Similarly to the Heidelberg set, all B-scans were resampled to have a pixel spacing of 13.9 µm × 3.9 µm.

Bioptigen dataset (B test )
The other external test set was described by [45], containing normal patients and patients with intermediate AMD.For each of these subjects one OCT volume, acquired with an SD-OCT scanner from Bioptigen, Inc (Research Triangle Parc, NC), was available.The AREDS2 system [46] was used for grading and was harmonized to CIRCL grading system.The number of OCTs for these two classes is given in Table 3.All OCT volumes contained 100 B-scans and, all B-scans were again resampled to have a pixel spacing of 13.9 µm × 3.9 µm.
5 Experimental design

Vendor generalizability and interpretability
To assess vendor generalizability of UBIX, we first calculated the performance on the internal test set, H test , which is from the same distribution as the one used for training, H train .Subsequently, we evaluated the performance when transferring to the two external datasets T test and B test .In the remainder of this paper, when we mention only UBIX, we refer to UBIX with the optimal δ that resulted from the grid search described later in Section 5.5, unless stated otherwise.We evaluated the performance on these datasets for UBIX with δ = ∞ and UBIX.Additionally, we compared the performance of the proposed model with three different approaches, namely a 3D CNN approach, a traditional MIL approach (without UBIX) and an ensemble of multiple MIL approaches.The 3D CNN was a ResNet-18 ( [47]) with 3D convolutions and the instance-level classifiers in the MIL approaches were ResNet-18's with 2D convolutions.
To better show the effect of the proposed methodologies on scans with vendor-specific artifacts, we also separately evaluated the performance of the five aforementioned UBIX variants and ablations on a subset of OCT volumes in T test with blinking artifacts, referred to as T blink .These volumes generally have multiple B-scans in which the retina is not visible.The interpretability of UBIX is illustrated qualitatively by showing the instance-level predictions and uncertainties for several OCT scans.

Effect of artificial artifacts
To demonstrate the effect of UBIX more clearly, we performed experiments where we artificially corrupted the dataset with artifacts that also occur naturally in OCT scans.The different artifact types were blinking artifacts, vertically flipped B-scans, shadows and noise.Fig. 3 shows a number of examples.We gradually introduced more OCT volumes with artificial artifacts and compared the performance for UBIX, UBIX with δ = ∞ and MIL.
When one of these artifacts was applied to an OCT volume, a portion of the B-scans were affected, as happens in clinical scenarios.Artificial artifacts were then added to either one or two groups of adjacent B-scans.Both scenarios had an equal probability.The sizes of these groups had sizes of between 2% and 15% of B-scans, which we experimentally found to be representative of real artifacts.
Vertically flipped B-scans are caused by a Fourier-domain detection artifact, as described by [48].Shadows and noise are usually caused by media opacities, such as corneal scarring and cataract.The artificial artifacts were implemented as follows: • B-scans with artificial blinking artifacts were generated by taking an image in which all pixel values are 0 to which we applied additive random Gaussian noise for which the mean was the median pixel value in the full OCT scan and the standard deviation was equal to the standard deviation of the OCT scan.
• The vertically flipped B-scans were generated by flipping the B-scans along the horizontal axis.
• To generate the shadow artifact for a particular B-scan, we adapted each A-scan (column in an OCT B-scan) a 1 , . . ., a A separately, where A is the number of A-scans in the B-scan.All A-scans a i were transformed using the shadow function S(a i ): where s(i) is sampled from a normal probability density function: where µ is randomly selected between 0 and A, and σ is randomly defined between A/4 and 3A/4.µ and σ are kept the same within one OCT volume.
• The noise artifact is Gaussian noise added to the original image with a mean of 0 and a standard deviation of 4 times the standard deviation within the original OCT volume.

Effect of uncertainty measures
We assessed the performances of different uncertainty measures on T test .The uncertainty measures that we evaluated were the ones presented in Table 1.Moreover, we separately evaluated the performance of the different uncertainty measures on T blink .

Metrics
For all models, to evaluate the classification performance, we calculated the area under the receiver operating characteristic curve (AUC), where intermediate and advanced AMD stages belonged to the positive class and the remaining stages to the negative class.Additionally, we computed Cohen's kappa score.For the datasets with more than two classes, the quadratic weighted kappa score (κ w ) was calculated to consider the class order.Otherwise, unweighted kappa metric was used (κ).
To quantify artificial artifact detection performance for different uncertainty measures, we used the AUC as well, where the score was the uncertainty measure and the labels were the dichotomous variable of whether an instance had an artificial artifact or not.Furthermore, to estimate how well the uncertainty values were separated, we evaluated the separability of the two groups, with and without artificial artifacts, based on the uncertainty score.For this, the Xie-Beni index (XB) is calculated, defined as the ratio between cluster separation (i.e. the minimum squared distance between cluster centers) and cluster compactness (i.e. the mean squared distance between each data point and its cluster center ( [49]).The lower XB, the better the data is clustered.
The network weights were optimized with the Adam optimizer ( [50]) and a learning rate of 10 -4 .All images were normalized between 0 and 1.As a means of regularization, we employed online data augmentation.With a 15% probability, random affine transformations were applied of ±20 • rotation within the B-scan plane, ±10% shearing within the B-scan plane, ±10% zooming within the B-scan plane, ±20 voxels translation in the horizontal and vertical direction within the B-scan plane, ±2 voxels translation in the B-scan direction.B-scans were also horizontally flipped with a 15% probability.With a 30% probability, random additive Gaussian noise with a mean of 0 and a standard deviation of 0.1 was applied.Also with a 30% probability, we applied brightness modifications using the power law, varying the power between 0.75 and 3. To account for class imbalance, images were sampled during training based on their class, such that each class was sampled with an equal probability.Because of GPU memory restrictions and given that the input images were large, the batch size was 1.We used early stopping, based on κ w and with a patience of 10 000 batches.The deep ensembles contained 5 models, as this was shown to be sufficient for uncertainty estimation ( [5]).The network weights were optimized with the Adam optimizer ( [50]) and a learning rate of 10 -4 .All images were normalized between 0 and 1.As a means of regularization, we employed online data augmentation.With a 15% probability, random affine transformations were applied of ±20 • rotation within the B-scan 0.915 ± 0.020 0.924 ± 0.020 * ‡ 0.922 ± 0.020 * plane, ±10% shearing within the B-scan plane, ±10% zooming within the B-scan plane, ±20 voxels translation in the horizontal and vertical direction within the B-scan plane, ±2 voxels translation in the B-scan direction.B-scans were also horizontally flipped with a 15% probability.With a 30% probability, random additive Gaussian noise with a mean of 0 and a standard deviation of 0.1 was applied.Also with a 30% probability, we applied brightness modifications using the power law, varying the power between 0.75 and 3. To account for class imbalance, images were sampled during training based on their class, such that each class was sampled with an equal probability.Because of GPU memory restrictions and given that the input images were large, the batch size was 1.We used early stopping, based on κ w and with a patience of 10 000 batches.The deep ensembles contained 5 models, as this was shown to be sufficient for uncertainty estimation ( [5]).The network weights were optimized with the Adam optimizer ( [50]) and a learning rate of 10 -4 .All images were normalized between 0 and 1.As a means of regularization, we employed online data augmentation.With a 15% probability, random affine transformations were applied of ±20 • rotation within the Bscan plane, ±10% shearing within the B-scan plane, ±10% zooming within the B-scan plane, ±20 voxels translation in the horizontal and vertical direction within the B-scan plane, ±2 voxels translation in the B-scan direction.Bscans were also horizontally flipped with a 15% probability.With a 30% probability, random additive Gaussian noise with a mean of 0 and a standard deviation of 0.1 was applied.Also with a 30% probability, we applied brightness modifications using the power law, varying the power between 0.75 and 3. To account for class imbalance, images were sampled during training based on their class, such that each class was sampled with an equal probability.Because of GPU memory restrictions and given that the input images were large, the batch size was 1.We used early stopping, based on κ w and with a patience of 10 000 batches.The deep ensembles contained 5 models, as this was shown to be sufficient for uncertainty estimation ( [5]).

Results
Table 4 shows the performance of the two UBIX variants, compared to three ablated models.Fig. 4 shows the performance differences in percentages for these methods.On the internal dataset, most differences between models are smaller than those found for the external datasets.The performance on the external dataset drops compared to the internal dataset for all methods.However, in terms of κ w this drop is always smaller for the UBIX models than for the other models.UBIX especially outperforms the ablated models on cases with artifacts; on T blink , κ w is 0.708 for UBIX, while it was 0.479 for only MIL, 0.346 for MIL (no ensemble), and 0.084 for 3D.Fig. 6 shows the effect of adding artificial artifacts to the dataset when using UBIX, compared to MIL, by evaluating their performance when varying the percentage of OCT volumes that are affected by artificial artifacts in the T test dataset.
Figure 4: Performance differences between datasets of the various methods.In the left plot, κ w is shown for H test , T test and T blink , and κ is shown for B test .In the right plot, AUC is shown.The absolute performances are indicated by the width of the bar plots, while the performance differences of the three external datasets compared to the internal test set are indicated with text.
Table 5 shows the performances on T test of UBIX when using different uncertainty measures, where the hyperparameters were optimized separately on the internal validation set for each uncertainty measure.For T test , κ w is highest for ordinal entropy.

Discussion
We proposed a method using MIL with OOD detection to improve the generalizability of deep learning models for classification of 3D medical images.The model aims to reduce the effect of on-the-fly detected OOD instances in the final classification of the bag.By suppressing the contribution of OOD instances, our proposed UBIX function maintains performance on unseen data distributions, namely images coming from different scanners.
The robustness of the proposed approach was demonstrated by transferring UBIX models and corresponding ablated versions to external datasets from different vendors.As shown in Fig. 4, UBIX variants are less prone to significant performance drops than the other models.On all external datasets, either UBIX or UBIX with δ = ∞ showed better results than the ablated models in terms of absolute performance.The performance drop was most notable on T blink , where UBIX maintained a κ w of 0.708, while the best and worst performing ablation models (MIL and 3D, respectively) has a κ w of 0.479 and Kw of 0.084, respectively.It was expected that these performance differences were more notable on T blink , which only contained OCTs with blinking artifacts, because UBIX was designed to be robust to vendor-specific artifacts.It was noted that, depending on the inference data, it was preferable to fully exclude the outputs of uncertain instances (UBIX with δ = ∞) or to only suppressed them (UBIX).From Fig. 6, we found that UBIX with δ = ∞ showed better robustness than UBIX for OCTs with artificial artifacts.A possible reason for this could be that some of the artificial artifacts highly corrupted the information, resulting in a notably strong incorrect signal and the requirement for full exclusion of uncertain instances.UBIX with δ = ∞ seemed to be especially robust to shadow artifacts, given that the performance barely decreased when introducing more OCT volumes with artificial artifacts.For some external datasets and metrics, however, UBIX achieved better results than UBIX with δ = ∞, e.g. for κ w on T blink and AUC on T test .
The performed data augmentation might also have an effect on generalizability.Since signal-to-noise ratios differ per scanner, noise augmentation probably aided our models to generalize.Although we did resample all images to have the same pixel spacing within B-scans, the original spacings that we had could have been slightly inaccurate.Therefore, zooming could potentially also have a positive effect on generalizability.The same type of data augmentation was applied in all experiments and measuring its effect was considered out of the scope for this paper.Large variability in B-scan spacing between scanners can also cause features learned by 3D CNNs to be poorly generalizable.MIL, which processes B-scans individually and combines B-scan level outputs using a MIL pooling function to get a volume level output, improves the robustness to this variability in slice spacing with respect to 3D models.We observed that performance differences are minimal between a 3D CNN and MIL when evaluated on data from the same vendor used during training.When evaluating on data from a different vendor, a performance drop was observed for both methods, although this drop is much larger for the 3D method (60.0% drop in κ w on T test ) than for MIL (29.3% in κ w on T test ).
The extent to which UBIX improves generalizability depends highly on the quality of its underlying OOD detection.Therefore, we compared three different commonly used uncertainty measures, and we proposed two ordinal variants.On T test , entropy and its ordinal variant had the highest performance in terms of κ w and AUC, respectively.The ordinal variants seemed to distinguish the B-scans with and without artificial artifacts the best.This can be seen in the density plots of Fig. 7, and this is also reflected in the AUC and XB values, which were best for the two ordinal variants.Hence, the ordinal variants led to higher performances on artificial artifacts, although in T test we find mixed observations (Table 5).A possible reason for this could be that fewer evaluation data with real artifacts were available, resulting in a less accurate performance measurement than when using artificial artifacts.
One of the advantages of using MIL as base of the UBIX model is that instance-level annotations are not required for training, while the model is able to produce a classification output at this level (B-scan level in our case) as well as calculating instance-level uncertainty.This introduces model explainability, increasing the transparency of our method, and allowing surveillance of its behaviour.
Since the three datasets were acquired and annotated at different sites with varying protocols, the reference standards were set different among these datasets.To minimize the effect of this discrepancy, we merged the first two classes of the CIRCL grading when evaluating with the WARMGS system which was available for T test .Moreover, when evaluating on B test for which only the binary labels No AMD and Intermediate AMD were available, we also binarized the CIRCL systems, where the positive class started at Intermediate AMD.This harmonization approach, however, was not perfect, causing the resulting class definitions to still not be completely equal.Despite this discrepancy, we think measuring performance differences between methods is still well possible.Nevertheless, the absolute performances can be underestimated because of these differences in reference standards.
As a potential undesired side effect, it should be noted that difficult cases, which are assumed be more uncertain, could be excluded which are in fact necessary for making a correct prediction.It will depend on the setting of model deployment whether the benefits of robustness to OOD data outweigh this drawback.In screening, for example, a high specificity is generally considered more important than a high sensitivity.Instances that are falsely excluded by UBIX will often contain abnormalities, in which case especially sensitivity would suffer, but specificity will not be affected in that case.When the task is disease staging (as is the case in this work) and UBIX is implemented with an ordinal uncertainty measure, this potential undesired side-effect is unlikely to occur.Difficulty in such a case will usually lie in the uncertainty between two classes that are close together in the staging scale (such as Early AMD and Intermediate AMD), resulting in a low estimate of the ordinal uncertainty measure.As artifacts are not likely to cause uncertainties between two classes that are close on the staging scale, these artifacts will probably have a much higher ordinal uncertainty measure.To give an indication of which instances were assigned the highest uncertainties, we manually analyzed the B-scans with the highest ordinal entropies in T test in Section A of the Appendix.There we found that more than half of the B-scans in the first percentile of most uncertain ones contained one of the nine different types of artifacts that were seemingly related to image acquisition.
If relevant structures are entirely excluded because they are difficult for the model to classify (for example because of an unseen lesion type or ambiguity), this is likely to be because there is something atypical with the whole scan or setting in which the model is used and the user should be alerted.So instead of only silently excluding instances, future work could analyze a method for combining UBIX with alerting the user if there are too many OOD instances detected.In future work, UBIX could also be adapted to work with patches as instances instead of slices.If an artifact is only locally present within a slice, the entire slice would not be excluded and potentially useful information would not be ignored.
We did not compare our method to any other domain adaptation methods which often require additional supervised or unsupervised training.Such a comparison would be unfair, as our method does not require any additional training.Nevertheless, our approach could potentially be further improved with the incorporation of domain adaptation methods such as those proposed by [38] and [39].Future work could also investigate different OOD detection methods to be incorporated in UBIX, as UBIX is theoretically compatible with any OOD detection method.Performing a systematic analysis for OOD detection methods was beyond the scope of this paper, so we only applied deep ensembles in this study.Such a comparison might lead to valuable insights and performance improvements.In this work, we only evaluated our approach for AMD grading in OCT.The method is expected to be applicable in more problem settings, such as the classification of other features and retinal diseases in OCT, but potentially also in other medical image analysis applications.

Conclusion
We showed that the generalizability of classification models to unseen scenarios can be improved by UBIX, an approach that seamlessly suppresses the contribution of OOD instances to the final classification during test-time based on the uncertainty associated with these instances, in the context of MIL for AMD classification in OCT.Our proposed approach alleviates the need for retraining on new data, which is an expensive process in terms of data acquisition, model development, and human annotation time.This increases reliability by improving the applicability of artificial intelligence models for OCT classification in broader scopes than the settings in which they were developed.

Figure 3 :
Figure 3: Examples of artificial artifacts.Each row shows the middle B-scans of a random OCT volume from T test .The image in the first row is the original, unaltered B-scan.The other rows each depict a different artificial artifact applied to that B-scan.

Fig. 5
Fig. 5 shows a number of visual examples of UBIX performance.The figure also illustrates the interpretability of the model at B-scan level.

Fig. 7
Fig.7visualizes the different uncertainty measures for artificially added artifacts.The XB value and AUC are largest when using ordinal variance or ordinal entropy.
(a) UBIX correctly predicts "No AMD/Early AMD", while MIL incorrectly predicts "Intermediate AMD".UBIX suppresses the instance-level outputs at the location of the artifact around B-scan 57, causing it to be robust to that artifact, in contrast to MIL.(b) Correct GA volume-level classification.Central GA is also well visible in the en face image as a white circular object around B-scan 60, which is pointed out by the instance-level output in red.

Figure 5 :
Figure 5: Examples where UBIX corrects volume-level and instance-level predictions.The figure also illustrates instance-level interpretability.Each subfigure shows, from left to right, the uncertainty per instance, the instance-level logits of the first model in the ensemble (only showing one model for clarity), the en face image (the volume averaged over the y-axis), and two B-scans of interest.The uncertainty, logit and en face plots correspond spatially in the horizontal direction.The left and right banner in the en face view indicate the instance-level outputs for MIL and UBIX, respectively.The B-scans on the right are highlighted in the en face view.In the bottom right of the en face views, volume-level probabilities are shown.

Figure 6 :
Figure 6: Robustness to different artificial artifacts of the UBIX, compared to MIL, on T test .The top image in each column shows an example B-scan of the artificial artifact.The plots show the relation between the performance and the percentage of OCT volumes in the dataset that contain these artifacts.The shaded areas indicate 95% confidence intervals, obtained using bootstrapping with 1000 iterations.

Figure 7 :
Figure 7: Density plots of uncertainty estimates at instances with artificial blinking artifacts compared to uncertainty estimates at instances without artificial artifacts, applied to T test .The AUC for artificial artifact detection performance and the clustering metric XB (the lower, the better) are shown above the density plots.

Table 1 :
Equations for the different uncertainty measures used in this work.U i is the uncertainty for instance x i , C is the number of classes, M is the number of models in the ensemble, and p i,m,c is the probability for instance x i , in model m of the ensemble, of class c.

Table 3 :
AMD stage distribution in each dataset.The table shows the number of OCT scans.H train H val H test T test B test

Table 4 :
[51]ormance metrics for the different methods, evaluated on internal and external datasets.The models were all trained and validated on Heidelberg data (H train and H val ).The UBIX methods in this table use ordinal entropy as uncertainty measure.n is the number of OCT scans.Symbols in the last two columns indicate statistically significant differences between 3D * , MIL (ensemble) † , and MIL ‡ , calculated with non-parametric bootstrapping and 1000 iterations ([51]).We applied a Bonferroni correction to account for the number of comparisons we made.Bolded values indicate the highest value in the row.

Table 5 :
Performance metrics for different uncertainty measures used the during instance exclusion step of the UBIX approach, evaluated on T test .The models were all trained and validated on Heidelberg data (H train and H val ).n is the number of OCT scans.Bolded values indicate the highest value in the row.