Aleatoric uncertainty estimation with test-time augmentation for medical image segmentation with convolutional neural networks

Despite the state-of-the-art performance for medical image segmentation, deep convolutional neural networks (CNNs) have rarely provided uncertainty estimations regarding their segmentation outputs, e.g., model (epistemic) and image-based (aleatoric) uncertainties. In this work, we analyze these different types of uncertainties for CNN-based 2D and 3D medical image segmentation tasks at both pixel level and structure level. We additionally propose a test-time augmentation-based aleatoric uncertainty to analyze the effect of different transformations of the input image on the segmentation output. Test-time augmentation has been previously used to improve segmentation accuracy, yet not been formulated in a consistent mathematical framework. Hence, we also propose a theoretical formulation of test-time augmentation, where a distribution of the prediction is estimated by Monte Carlo simulation with prior distributions of parameters in an image acquisition model that involves image transformations and noise. We compare and combine our proposed aleatoric uncertainty with model uncertainty. Experiments with segmentation of fetal brains and brain tumors from 2D and 3D Magnetic Resonance Images (MRI) showed that 1) the test-time augmentation-based aleatoric uncertainty provides a better uncertainty estimation than calculating the test-time dropout-based model uncertainty alone and helps to reduce overconfident incorrect predictions, and 2) our test-time augmentation outperforms a single-prediction baseline and dropout-based multiple predictions.


2D Skin Lesion Segmentation
We further validated our proposed method with the International Skin Imaging Collaboration (ISIC) 2018 skin lesion segmentation dataset (Tschandl et al., 2018;Codella et al., 2018). Skin cancer is the most prevalent cancer in the United States where melanoma is the most dangerous type. Dermoscopy is a promising imaging technique for diagnosis of skin cancer (Siegel et al., 2017). Automatic assessment of dermoscopic images is attracting increasing attentions due to the shortage of dermatologists per capita. Segmentation of the lesion regions plays an important role for automatic measurement and diagnosis of skin cancer (Yuan et al., 2017).

Data and Implementation
We used the publicly available dataset of ISIC 2018 skin lesion segmentation challenge 1 (Tschandl et al., 2018;Codella et al., 2018). The lesion images were collected with a variety of dermatoscope types from several different institutions. Each image contained exactly one primary lesion, and smaller secondary lesions, other pigmented regions or other fiducial markers may be neglected. The released training dataset consisted of 2594 images with corresponding ground truth masks annotated by human experts. We randomly split them into 2000 images for training, 294 images for validation and 300 images for testing. We resized these images into a consistent size 192×192.
For experiments, we used 2D U-Net (Ronneberger et al., 2015) and Dense U-Net (Guan et al., 2018) that is an extension of U-Net with dense connection blocks. The networks were implemented in TensorFlow 2 (Abadi et al., 2016) using NiftyNet 3 (Li et al., 2017;Gibson et al., 2018). During training, we used Adaptive Moment Estimation (Adam) to adjust the learning rate that was initialized as 10 −3 , with batch size 10, weight decay 10 −7 and iteration number 20k. We represented the transformation parameter β in the proposed augmentation framework as a combination of f l , r and s, where f l is a random variable for flipping in 2D, r is the rotation angle in 2D, and s is a scaling factor. The prior distributions of these parameters were modeled as: f l ∼ Bern(0.5), r ∼ U (0, 2π), s ∼ U (0.8, 1.2) and e ∼ N (0, 0.05). We used data augmentation at both training and test time based on this formulation. Fig. 1 shows a visual comparison of different types of uncertainties for segmentation of skin lesion. The results were based on the same trained model of Dense U-Net, and the Monte Carlo simulation number N was 40 for TTD, TTA, and TTA + TTD to obtain epistemic, aleatoric and hybrid uncertainties respectively. The subfigures show three cases with different skin lesion sizes and appearances. In Fig. 1 (a), the first row presents the input and the segmentation obtained by the single-prediction baseline. The other rows show the three types of uncertainties and their corresponding segmentation results respectively. It can be observed that the TTD-based epistemic Segmentation Ground truth In odd columns:

Segmentation Results with Uncertainty
In even columns: uncertainty map mainly highlights the border of the segmented foreground.
In contrast, the TTA-based aleatoric uncertainty map shows uncertain segmentations not only on the border but also in some challenging areas in the left up corner of the image. It can be observed that both the TTA-based aleatoric and hybrid uncertainty maps have a better performance in indicating potential mis-segmentations than the TTD-based epistemic uncertainty.

Quantitative Evaluation
To quantitatively evaluate the segmentation results, we measured Dice score and ASSD of each prediction for different testing methods of baseline single prediction, TTD, TTA and TTA + TTD. We also compared training with and without data augmentation. We found the Monte Carlo sample number N that obtained the performance plateau was 40. Table 1 shows the quantitative evaluation results for these different testing methods when N Table 1: Dice (%) and ASSD (pixels) evaluation of 2D skin lesion segmentation by different training and testing methods. Tr-Aug: Training without data augmentation. Tr+Aug: Training with data augmentation.* denotes significant improvement from the baseline of single prediction in Tr-Aug and Tr+Aug respectively (p-value < 0.05). † denotes significant improvement from Tr-Aug with TTA + TTD (p-value < 0.05).

Train
Test Dice (%) ASSD (pixels) was 40. For both networks we found that TTA led to a higher improvement of segmentation accuracy than TTD.

Correlation between Uncertainty and Segmentation Error
We also investigated the correlation between prediction uncertainty and segmentation error. For pixel-level evaluation, we measured the joint histogram of pixel-wise uncertainty and pixel-wise error rate for TTD, TTA, and TTA + TTD respectively, and the joint histograms were normalized by the overall pixel number in test images. Fig. 2 shows the results based on Dense U-Net using training with data augmentation and N set as 40. For each type of uncertainties, we calculated the average error rate at each uncertainty level, and obtained a curve of error rate as a function of uncertainty, i.e., the red curves in Fig. 2. This figure shows that when the uncertainty increases, the error rate also becomes higher gradually. The curves in Fig. 2(b) and Fig. 2(c) have higher slopes than that in Fig. 2(a), showing that TTA has fewer overconfident incorrect predictions than TTD and a better correlation with mis-segmentations.
For structure-level evaluation, we measured structure-level uncertainty represented by volume variance coefficient (VVC) and structure-level error represented by 1 -Dice. Fig. 3 shows their joint distributions with three different testing methods using 2D Dense U-Net that was trained with data augmentation. The Monte Carlo sample number was 40. The figure shows that for all the three types of testing methods, the achieved structure-level uncertainties increase with 1 -Dice. However, TTA-based testing has a larger slope than TTD-based testing, as shown in Fig. 3(a) and (b). TTA + TTD obtained similar results compared with TTA, as shown in Fig. 3(c).