Anomaly Detection-Inspired Few-Shot Medical Image Segmentation Through Self-Supervision With Supervoxels

Recent work has shown that label-efficient few-shot learning through self-supervision can achieve promising medical image segmentation results. However, few-shot segmentation models typically rely on prototype representations of the semantic classes, resulting in a loss of local information that can degrade performance. This is particularly problematic for the typically large and highly heterogeneous background class in medical image segmentation problems. Previous works have attempted to address this issue by learning additional prototypes for each class, but since the prototypes are based on a limited number of slices, we argue that this ad-hoc solution is insufficient to capture the background properties. Motivated by this, and the observation that the foreground class (e.g., one organ) is relatively homogeneous, we propose a novel anomaly detection-inspired approach to few-shot medical image segmentation in which we refrain from modeling the background explicitly. Instead, we rely solely on a single foreground prototype to compute anomaly scores for all query pixels. The segmentation is then performed by thresholding these anomaly scores using a learned threshold. Assisted by a novel self-supervision task that exploits the 3D structure of medical images through supervoxels, our proposed anomaly detection-inspired few-shot medical image segmentation model outperforms previous state-of-the-art approaches on two representative MRI datasets for the tasks of abdominal organ segmentation and cardiac segmentation.


Introduction
Many applications in medical image analysis, such as diagnosis (Tsochatzidis et al., 2021), treatment planning (Chen et al., 2021), and quantification of tissue volumes (Abdeltawab et al., 2020) rely heavily on semantic segmentation. To lessen the burden on the medical practitioners performing these manual, slice-by-slice segmentations, the use of deep learning for automatic segmentation has a great potential. Unfortunately, existing segmentation frameworks (Ronneberger et al., 2015;Li et al., 2018;Isensee et al., 2021) depend on supervised training and large amounts of densely labeled data, which are often unavailable in the medical domain. Moreover, their generalization properties to previously unseen classes are typically poor, necessitating the collection and labeling of new data to re-train for new tasks. Due to the huge number of potential segmentation tasks in medical images, this makes these models impractical to use.
Inspired by how humans learn from only a handful of in-arXiv:2203.02048v1 [eess.IV] 3 Mar 2022 stances, few-shot learning has emerged as a learning paradigm to foster models that can easily adapt to new concepts when exposed to just a few new, labeled samples. These models typically follow an episodic framework (Vinyals et al., 2016) where, in each episode, k labeled samples, called the support set, are used to segment the unlabeled query image(s). The models are trained on one set of classes and learn to, with only a few annotated examples, segment objects from new classes. A trained few-shot segmentation (FSS) model is thus able to segment an unseen organ class based on just a few labeled instances. However, in order to avoid over-fitting, typical FSS models rely on training data containing a large set of labeled training classes, generally not available in the medical domain.
In a recent work, Ouyang et al. (2020) proposed a labelefficient approach to medical image segmentation, building on metric-learning based prototypical FSS (Liu et al., 2020b;Wang et al., 2019). They suggest a model that follows the traditional few-shot episodic framework, where class-wise prototypes are extracted from the labeled support set and used to reduce the segmentation of the unlabeled query image to a pixel-wise prototype matching in the embedding space. Whereas traditional few-shot learning models require a set of annotated training classes, Ouyang et al. (2020) propose a clever way to bypass this need by employing self-supervised training (Jing and Tian, 2020). Instead of sampling labeled support and query images, they construct the training episodes based on one unlabeled image slice and its corresponding superpixel (Ren and Malik, 2003) segmentation: One randomly sampled superpixel serves as foreground mask, and together with the original image slice, these form the support image-label pair. The query pair is then constructed by applying random transformations to the support pair. In this way, they enable training of the network without using annotations, i.e. the model is trained unsupervised. Finally, in the inference phase, they only need a few labeled image slices to perform segmentation on new classes. However, a general problem with prototypical FSS is the loss of local information caused by average pooling of features during prototype extraction. This is particularly problematic for spatially heterogeneous classes like the background class in medical image segmentation problems, which can contain any semantic class other than the foreground class. Previous metric-learning based works have addressed this issue by computing additional prototypes per class to capture more diverse features. Liu et al. (2020b) clustered the features within each class to obtain part-aware prototypes and in the current stateof-the-art method, Ouyang et al. (2020) computed additional local prototypes on a regular grid.
We argue that it is insufficient to model the entire background volume with prototypes estimated from a few support slices and propose a conceptually different approach where we do not increase the number of background prototypes but remove the need for these altogether. Inspired by the anomaly detection literature (Chandola et al., 2009;Ruff et al., 2021), we propose to only model the relatively homogeneous foreground class with a single prototype and introduce an anomaly score that measures the dissimilarity between this foreground prototype and all query pixels. Segmentation is then performed by thresholding the anomaly scores using a learned threshold that encourages compact foreground representations. For direct comparison of our novel anomaly detection-inspired few-shot medical image segmentation method to that of Ouyang et al. (2020) and other representative works, our baseline setup follows their approach, working with 2D image slices. Within the existing 2D setup, we, as an added contribution, propose a new self-supervision task by extending the superpixel-based self-supervision scheme by Ouyang et al. (2020) to 3D in order to utilize the volumetric nature of the data. As a natural extension, facilitated by the new self-supervision task, we further indicate potential benefits beyond this 2D setup by exploring a direct 3D treatment of the problem by employing a 3D convolutional neural network (CNN) as embedding network.
By only explicitly modeling the foreground class, we argue that our proposed approach is more robust to background outside the support slices, compared to current state-of-the-art methods (Ouyang et al., 2020;Roy et al., 2020). To further illustrate this, we introduce a new evaluation protocol where we, based on labeled slices from the support image, segment the entire query image, thus being more exposed to background effects. Previous works, on the other hand, limit the evaluation of the query image only to the slices containing the class of interest. However, this approach requires additional weak labels in the form of information about the location of the class in the query image, which is unrealistic and cumbersome, especially in the medical setting.
In summary, the main contributions of this work are threefold. We propose: (1) A simple but effective anomaly detection-inspired approach to FSS that outperforms prior state-of-the-art methods and removes the need to learn a large number of prototypes. (2) A novel self-supervision task that exploits the 3D structural information in medical images within the 2D setup and indicate the potential of training 3D CNNs for direct volume segmentation.
(3) A new evaluation protocol for few-shot medical image segmentation that does not rely on weak-labels and therefore is more applicable in practical scenarios.

Few-Shot Meta-learning
As opposed to classical supervised learning that specializes a model to perform one specific task by optimizing over training samples, few-shot meta-learning optimizes over a set of training tasks, with the goal of obtaining a model that can quickly adapt to new, unseen tasks. There exist various approaches to few-shot learning, including i) learning to fine-tune (Finn et al., 2017;Ravi and Larochelle, 2017), ii) sequence based (Mishra et al., 2018;Santoro et al., 2016), and iii) metric-learning based approaches (Vinyals et al., 2016;Snell et al., 2017;Nguyen et al., 2020). Due to its simplicity and efficiency, the latter category has recently received a lot of attention, and the models relevant for this paper build on this principle. Vinyals et al. (2016) combined deep feature learning with non-parametric methods in the Matching Network, by performing weighted nearestneighbor classification in the embedding space. They proposed to train the model in episodes where a small labeled support set and an unlabeled query image are mapped to the query label, making the model able to adapt to unseen classes without the need for fine-tuning. Whereas the Matching Network only performed one-shot image classification, Snell et al. (2017) later proposed the Prototypical Network, which extended the problem to include few-shot classification. Based on the idea that there exists an embedding space, in which samples cluster around their class prototype representation, they proposed a simpler model with a shared encoder between the support and query set, and a nearest-neighbor prototype matching in the embedding space.

Few-Shot Semantic Segmentation
Few-shot semantic segmentation extends few-shot image classification (Vinyals et al., 2016;Snell et al., 2017;Nguyen et al., 2020) to pixel-level classifications (Shaban et al., 2017;Rakelly et al., 2018;Zhang et al., 2020;Wang et al., 2019), and the goal is to, based on a few densely labeled samples from one (or more) new class(es), segment the class(es) in a new image. A recent line of work builds on the ideas from the Prototypical Network by Snell et al. (2017), and can be roughly split into two groups: models where predictions are based directly on the cosine similarity between query features and prototypes in the embedding space (Wang et al., 2019;Liu et al., 2020b;Ouyang et al., 2020), and models that find the correlation between query features and prototypes by employing decoding networks to get the final prediction (Dong and Xing, 2018;Zhang et al., 2019;Liu et al., 2020a;Li et al., 2021a;Zhang et al., 2021;Tian et al., 2020). Dong and Xing (2018) first adopted the idea of metriclearning based prototypical networks to perform few-shot semantic segmentation. They proposed a two-branched model: a prototype learner, learning class-wise prototypes from the labeled support set, and a segmentation network where the prototypes were used to guide the segmentation of the query image. Most relevant for this work, Wang et al. (2019) argued that parametric segmentation generalizes poorly, and proposed the Prototype Alignment Network (PANet), a simpler model where the knowledge extraction and segmentation process is separated. By exploiting prototypes extracted from the semantic classes of the support set, they reduced the segmentation of the query image to a non-parametric pixel-wise nearest-neighbor prototype matching, thereby creating a new branch of FSS models. Building on PANet, (Liu et al., 2020b) addressed the limitation of reducing semantic classes to a simple prototype and proposed the Part-aware Prototype Network (PPNet), where each semantic class is represented by multiple prototypes to capture more diverse features. Liu et al. (2020b) further adopted a semantic branch for parametric segmentation during training to learn better representations. Ouyang et al. (2020) adapted ideas from PANet to perform FSS in the medical domain. They addressed the major restricting factor preventing medical FSS, e.g the dependency on a large a set of annotated training classes. This barrier was overcome by the introduction of a superpixelbased self-supervised learning scheme, enabling the training of FSS networks without the need for labeled data. Ouyang et al. (2020) further introduced the Adaptive Local Prototype pooling enpowered prototypical Network (ALPNet) where additional local prototypes are computed on a regular grid to preserve local information and enhance segmentation performance.
A different approach to medical FSS was suggested by Roy et al. (2020), and was the first FSS model for medical image segmentation. Their proposed SE-Net employs squeeze and excite blocks (Hu et al., 2018) in a two-armed architecture consisting of one conditioner arm, processing the support set, and one segmenter arm, interacting with the conditioner arm to segment the query image. However, this model is trained supervised, requiring a set of labeled classes for training.
Based on our experience, training a decoder in a selfsupervised setting, where the training task (superpixel segmentation) differs from the inference task (organ segmentation), is challenging and leads to performance degradation. In this paper, we thus, partially inspired by the state-of-the-art model (Ouyang et al., 2020), build further on the branch initiated by Wang et al. (2019) to perform FSS in the medical domain. We propose a novel FSS model that, unlike previous approaches in this branch (Wang et al., 2019;Liu et al., 2020b;Ouyang et al., 2020), does not explicitly model the complex background class, but relies solely on one foreground prototype.

Self-Supervised Learning
When large labeled datasets are not available, selfsupervision can be used to learn representations by training the deep learning model on an auxiliary task that is defined such that the label is implicitly available from the data. A good auxiliary task should require high-level image understanding to be solved, thereby encouraging the network to encode this type of information. Commonly used auxiliary tasks include image inpaining (Larsson et al., 2016;Pathak et al., 2016;Zhang et al., 2016), contrastive learning Misra and Maaten, 2020), rotation prediction (Komodakis and Gidaris, 2018), solving jigsaw puzzles (Noroozi and Favaro, 2016), and relative patch location prediction (Doersch et al., 2015).
In the medical domain, self-supervised learning (SSL) has been used to improve performance on other (main) tasks by exploiting unlabeled data in a multi-task learning setting Li et al., 2021b) and to pre-train models before transferring them to new (main) tasks (Bai et al., 2019;Zhu et al., 2020;Dong et al., 2021;Lu et al., 2021). In Ouyang et al. (2020), SSL was used to train a FSS model completely unsupervised using a novel superpixel-based auxiliary task, removing the need for labeled data during training. We build on this work by extending the proposed self-supervision scheme to 3D supervoxels.

Supervoxel Segmentation
Supervoxels and superpixels are groupings of local voxels/pixels in an image that share similar characteristics. The boundaries of a supervoxel/superpixel therefore tend to follow the boundaries of the structures in the image, providing natural sub-regions. Supervoxel and superpixel segmentation has Fig. 1. Illustration of the model during training. Support and query slices are obtained from the same image volume as two different 2D slices containing a randomly sampled supervoxel. A shared feature encoder encodes the query and the support images into deep feature maps. The support features are then resized to the mask size and masked average pooling is applied to compute the foreground prototype. For each query feature vector, an anomaly score is computed based on the cosine similarity to the prototype. Finally, the segmentation of the query image is performed by thresholding the anomaly scores using a learned anomaly threshold. Fig. 2. Illustration of the model during inference. Based on labeled slices from the support volume, the query volume is segmented slice by slice, one class at a time.
become a common tool in computer vision, also in the medical domain Irving et al., 2016). For a detailed comparison of available superpixel segmentation algorithms, we refer the reader to (Stutz et al., 2018).

Problem Definition
Given a labeled dataset with classes C train (here: C train = {supervoxel 1 , supervoxel 2 , ...}), FSS models aim to learn a quick adaption to new classes C test (e.g. C test = {liver, kidney, spleen}) when exposed to only a few labeled samples. The training and testing are performed in an episodic manner (Vinyals et al., 2016) where, in each episode, N classes are sampled from C to create a support set and a query set. The input to an episode is the support image(s) (with annotations) and a query image, and the output is the predicted query mask. In an N-way k-shot setting, the support set S = {(x 1 , y 1 ), ..., (x N×k , y N×k ))} consists of k image slices x ∈ R H×W (with annotations y ∈ R H×W indicating the class of each pixel) from each of the N classes, whereas the query set consists of one query image Q = (x * 1 , y * 1 ) containing one or more of the N classes.

Methods
In this work, we propose an anomaly detection-inspired network (ADNet) for prototypical FSS 1 . We employ a shared feature extractor between the support and query images and perform metric learning-based segmentation in the embedding space. Unlike prior approaches that obtain prototypes for both foreground and background classes (Liu et al., 2020b;Ouyang et al., 2020;Wang et al., 2019), we only consider foreground prototypes to avoid the aforementioned problems related to explicitly modeling the large and heterogeneous background class. Based on one foreground prototype, we compute anomaly scores for all query feature vectors. The segmentation of the query image is then based on these anomaly scores and a learned anomaly threshold. To train our model, we take inspiration from Ouyang et al. (2020) and propose a new supervoxelbased self-supervision pipeline. Fig. 1 and Fig. 2 provide an overview of the model during training and inference, respectively.

Anomaly Detection-Inspired Few-Shot Segmentation
We denote the encoding network as f θ and start by embedding the support and query images into deep features, f θ (x) = F s and f θ (x * ) = F q , respectively. As opposed to previous works, we are only interested in explicitly modeling the foreground in each episode. We do this by employing the segmentation mask to perform masked average pooling (MAP), but only for the foreground class c. We resize the support feature map F s to the mask size (H, W) and compute one foreground prototype p ∈ R d , where d is the dimension of the embedding space: where denotes the Hadamard product and y f g = 1(y = c) is the binary foreground mask of class c 2 .
To segment the query image based on this one classprototype, we design a threshold-based metric learning approach to the segmentation. We first obtain an anomaly score S for each query feature vector F q (x, y) by calculating the (negative) cosine similarity to the foreground prototype p of the episode: where α = 20 is a scaling factor introduced by Oreshkin et al. (2018). In this way, query feature vectors that are identical to the prototype will get an anomaly score of −α (minimum), whereas query feature vectors that are pointing in the opposite direction, relative to the prototype, get an anomaly score of α (maximum). The predicted foreground mask is then found by thresholding these anomaly scores with a learned parameter T .
To make the process differentiable, we perform soft thresholding by applying a shifted Sigmoid: where σ(·) denotes the Sigmoid function with a steepness parameter κ = 0.5. The impact of the steepness parameter is examined in Section 5.3.4. In this way, query feature vectors with an anomaly score below T (similar to the prototype) get a foreground probability above 0.5, whereas query feature vectors with an anomaly score above T (dissimilar to the prototype) get a foreground probability below 0.5. The predicted background mask is finally found asŷ q bg = 1 −ŷ q f g . The predicted foreground and background masks for the query image are then upsampled to the image size (H, W) and we compute the binary cross-entropy segmentation loss: In order to encourage a compact embedding of the foreground classes, we construct an additional loss term L T = T/α that minimizes the learned threshold. The effect of this loss component is examined in Section 5.3.2. Following common practice (Liu et al., 2020b;Ouyang et al., 2020;Wang et al., 2019), we also add a prototype alignment regularization loss where the roles of support and query are reversed. The predicted query mask is used to compute a proto-2 1(·) is the indicator function, returning 1 if the argument is true and 0 otherwise. type that segments the support image: This gives us the overall loss function

Supervoxel-Based Self-Supervision
The ADNet is parameterized by P = {θ, T } and trained selfsupervised (unsupervised) end-to-end in an episodic manner. For ease of comparison to previous approaches, our baseline setup follows a 2D approach, where volumes are segmented slice-by-slice. However, to better utilize the volumetric nature of the medical images, we propose a new self-supervision task that exploits 3D supervoxels during the model's training phase. As supervoxels are sub-volumes of the image, representing groups of similar voxels in local regions of the image volume, this allows us to sample 3D pseudo-segmentation masks for semantically uniform regions in the image.
In the training phase, each episode is constructed based on one unlabeled image volume and its supervoxel segmentation: First, one random supervoxel is sampled to represent the foreground class, resulting in a binary 3D segmentation mask. Then, we sample two 2D slices from the image containing this "class"/supervoxel to serve as support and query images. By exploiting the relations across slices, we are able to increase the amount of information that can be extracted in the self-supervision task compared to prior approaches. Following Ouyang et al. (2020), we additionally apply random transformations to one of the images (query or support) to encourage invariance to shape and intensity differences.
The supervoxels for all image volumes are computed offline using a 3D extension of the same unsupervised segmentation algorithm (Felzenszwalb and Huttenlocher, 2004) as in (Ouyang et al., 2020). This is an efficient graph-based image segmentation algorithm building on euclidean distances between neighboring pixels. In the 3D extension, this corresponds to the distances from each voxel to its 26 nearest neighbours. In medical images, the resolution in z-direction (slice thickness) is typically different from the in-plane (x, y) resolution. To account for this anisotropic voxel resolution, we re-weight all distances along the z-direction (xz−, yz− and xyz−direction) according to the spatial ratios.
The supervoxel generation has one hyper-parameter ρ that controls the minimum supervoxel size, where a larger ρ corresponds to larger and fewer supervoxels. The effect of this parameter on the final segmentation result is examined in Section 5.3.3.

Implementation Details
The implementation is based on the PyTorch (v1.7.1) implementation of SSL-ALPNet (Ouyang et al., 2020). The encoder network used in all the 2D experiments is a ResNet-101 pretrained on MS-COCO, where the classifier is replaced by a 1 × 1 convolutional layer to reduce the feature dimension from 2048 to 256. Following ALPNet, we optimize the loss using stochastic gradient descent with momentum 0.9, a learning rate of 1e-3 with a decay rate of 0.98 per 1k epochs, and a weight decay of 5e-4 over 50k iterations. To address the class imbalance, we follow previous work and weigh the foreground and background class in the cross-entropy loss (1.0 and 0.1, respectively). To further stabilize training, we set a minimum threshold of 200 pixels on the supervoxel size in the slices sampled as support/query. Supervoxel generation is done offline (once per image volume) and is relatively computationally efficient 3 . Training takes 1.8h on a Nivida RTX 2080Ti GPU.
(2) CHAOS, from the ISBI 2019 Combined Healthy Abdominal Organ Segmentation Challenge (task 5), containing 20 3D T2-SPIR MRI scans with on average 36 slices (Kavur et al., 2021(Kavur et al., , 2019(Kavur et al., , 2020. To compare our results to Ouyang et al. (2020), we follow the same pre-processing scheme: 1) Cut the top 0.5% intensities. 2) Re-sample image slices (short-axis slices for the cardiac images and axial slices for the abdominal images) to the same spatial resolution. 3) Crop slices to unify size (256×256 pixels). Further, to fit into the pretrained network, each slice is repeated three times along the channel dimension.
In all experiments, the models are trained self-supervised (unsupervised) and evaluated in a five-fold cross-validation manner, where, in each fold, the support images are sampled from one of the patients and the remaining patients are treated as query (see Fig. 3). Furthermore, to account for the stochasticity in the model and optimization, we repeat each fold three times. In the cardiac MRI scans we segment three classes: Leftventricle blood pool (LV-BP), left-ventricle myocardium (LV-MYO) and right-ventricle (RV). In the abdominal MRI scans, we segment four classes: left kidney (L. kid.), right kidney (R. kid.), liver, and spleen. Following previous methods (Ouyang et al., 2020;Roy et al., 2020), each class is segmented separately in binary foreground/background segmentation problems 5 . Since the models are trained self-supervised, we do not exclude image slices that contain the target classes. Fig. 3. Setup for the five-fold cross-validation. This illustrates how the patient IDs are distributed among the splits and how the support/query volumes are selected for the cardiac MRI dataset. For each fold, a model is trained on all images not present in that fold. During inference, the leftout fold is used exclusively, where the labeled support image is exploited to segment the query images slice by slice, class by class. The CHAOS dataset is split into five folds in a similar manner. Fig. 4. Illustration of EP1 (top) and EP2 (bottom). In EP1, the support and query volumes are divided into three succeeding sub-chunks. The middle slice in each sub-chunk of the support volume is labeled and used to segment all the slices in the corresponding sub-chunk in the query volume. This means that the protocol requires weak labels indicating where the class of interest is located in the query volume. In EP2, the middle slice of the support volume is labeled and used to segment all slices in the query volume, avoiding the need for additional weak labels.

Evaluation metric
Following common practice (Ouyang et al., 2020;Roy et al., 2020) we employ the mean dice score to compare the model predictions to the ground truth segmentations. The dice score, D, between two segmentations A and B is given by meaning that a dice score of 100% corresponds to a perfect match between the segmentations.

Evaluation protocols
During inference, the query volumes are segmented episodewise, slice-by-slice, based on labeled support slices. For this reason, it is necessary to define an evaluation protocol that describes how to construct the episodes during inference, i.e. how to pair support and query images in episodes. In the experiments, we evaluate all models under two different evaluation protocols (EPs), illustrated in Fig. 4.
Evaluation protocol 1 (EP1). Previous works (Ouyang et al., 2020;Roy et al., 2020) follow an evaluation protocol that requires weak labels for all query images, i.e. there is a need to indicate (label) in which slices the foreground class is located. For a given class to be segmented, the chunk of slices in both the support and query volumes containing this class is divided into three succeeding sub-chunks. The middle slice in each sub-chunk of the support volume is used to segment all the slices in the corresponding sub-chunk in the query. In practice, this requires manual and time-consuming input from medical experts during the inference phase, where they have to scroll through each query image volume to mark the slices containing the class(es) of interest.
Evaluation protocol 2 (EP2). To avoid the need for weak query labels during inference, we introduce a new evaluation protocol that does not depend on the position of the target volume, and thus is more applicable in practical situations. Here, we simply sample k = 1 slices from the support foreground volume and use this information to segment the entire query volume. To limit boundary effects, we choose the middle slice of the support foreground volume.

Comparison to state-of-the-art
We compare our model to three modern FSS models: PANet (Wang et al., 2019), ALPNet Ouyang et al. (2020), and PPNet (Liu et al., 2020b) with five (default) prototypes per class. Additionally, to compare our one-prototype anomaly approach to a one-prototype decoder approach, we adopt the dense comparison module proposed in (Zhang et al., 2019) as Table 3. Summarized information about the models. * The number of protototypes in ALPNet is adaptive and we report the average number over all classes during inference. a decoder on top of the backbone network and refer to this network as CANet 6 . The current state-of-the-art method for medical FSS, Ouyang et al. (2020), showed that training PANet and ALPNet in a selfsupervised manner improved the dice scores of the segmentation results considerably, compared to classical supervised FSS. Specifically, the dice scores on the MS-CMRSeg and CHAOS datasets increased by an average of 17.9 and 26.1 percentage points, respectively. Here, we are thus only focusing on SSL approaches. pSSL refers to the superpixel SSL approach presented in Ouyang et al. (2020), whereas vSSL refers to our proposed supervoxel-based approach. Table 1 and Table 2 present the results under EP1 and EP2, respectively, as mean and standard deviations over three runs (over all splits). Summarized details about the models can be found in Table 3.
In Table 1 we can see that our proposed model under EP1 performs similarly to the state-of-the-art on both datasets, while using significantly fewer prototypes compared to the closest competitors. We can also observe that the models that use just a few  prototypes to model the background (PANet, PPNet) perform poorly and are among the three worst performing models for both datasets. Furthermore, by only modeling the foreground class and segmenting the query image using a decoding network, CANet results in the lowest (overall) dice score on the cardiac dataset.
In a more realistic scenario, information about the location of the foreground volume in the query images is typically not available. We therefore evaluate the models under EP2 (Table 2) and we observe that our proposed approach outperforms the state-of-the-art. One-sided Wilcoxon signed rank tests (Wilcoxon, 1992) on the mean dice scores across all runs indicate a significant difference between the segmentation results obtained from vSSL-ADNet and pSSL-ALPNet for both datasets under EP2 (p < 0.05). For the abdominal data, our model improves the segmentation results by more than 20 percentage points compared to pSSL-ALPNet. The main reason for this large improvement is that we now have to consider all the query slices (not only the slices containing the organ to be segmented), meaning that the background class is much larger and much more diverse. This again complicates the task of modeling the background with prototypes, whereas our  anomaly detection-inspired model without background prototypes is less affected. The somewhat lower performance and high standard deviation for left-kidney and spleen are related to the weak boundaries between these organs (see discussion in Section 6). Furthermore, we obtain considerable, but smaller, improvements on the cardiac dataset under EP2. This is related to the lower number of slices and the less diverse background in these images, making the task of modeling the background with prototypes less complicated. Qualitative comparisons are provided in Fig. 5 and Fig. 6, where we can see that our approach is less prone to over-segmentation.

Model analysis 5.3.1. Analysis of learned threshold
To evaluate the learned threshold's precision on the unseen test data, we have conducted a line search where we, in the inference phase, evaluate the dice score obtained using a range of different thresholds between -20 and -15. The experiment was performed on three runs for each split and the mean dice score and standard deviation (shaded region) are reported in Fig. 7. The learned threshold is averaged over all runs and and represented by the vertical black line 7 . From the plot, we see that the threshold optimized for the training data is close to the ideal threshold for the test data, with little to gain in terms of increased dice score.

Ablation study
To evaluate the effect of the three components of our loss function, we conduct an ablation study on the cardiac dataset. Table 4 illustrates that L T and L PAR improve the dice score across all classes. Further, Fig. 8 shows qualitatively the effect of L T on the segmentation of one image slice from the MS-CMRSeg dataset. Here, it can be seen how the encouraging of a more compact foreground embedding via L T reduces the oversegmentation, especially for the left-ventricle myocardium.

Sensitivity of supervoxel size
A sensitivity analysis of the parameter ρ, controlling the supervoxel size, is conducted on the MS-CMRSeg dataset and the results are presented in Table 5. As shown by these results, the final segmentation performance is relatively robust for a range of minimum size values from ρ = 1000 to ρ = 2000. However, if we allow the sizes to become too small (ρ = 500) or too large (ρ = 5000), we see that the performance is negatively affected. Examples of 2D slices from the 3D supervoxel segmentations for the different values of ρ are shown in Fig. 9.
According to the sensitivity study, a reasonable value is ρ = 1000, and all the reported vSSL results are obtained with this value for the MS-CMRSeg dataset and ρ = 5000 for the CHAOS dataset, unless otherwise stated. The difference in value of ρ reflects the differences in volume size.

Influence of steepness parameter
The steepness of the sigmoid function controls how soft the threshold operation performed is. If the steepness is high Fig. 9. Examples of supervoxel segmentation results in one slice from the MS-CMRSeg dataset for different values of ρ. The parameter ρ controls the minimum size of a supervoxel for it not to be joined with an adjacent supervoxel. A larger ρ corresponds to larger and fewer supervoxels.  (harder thresholding), the class assignments of samples becomes harder, also close to the threshold. To examine the influence of the steepness parameter, κ, on the final segmentation results, we have conducted six experiments with different values of κ, from κ = 0.1 to κ = 1.0 on the MS-CMRSeg dataset 8 . The results presented in Table 6 indicate the model's robustness with respect to this parameter, and we can observe a gain of more than two percentage points in the dice score by decreasing the steepness from 1.0 to 0.5.

vSSL vs pSSL
To disentangle and isolate the effect from the proposed extension of the self-supervision task, we have conducted additional experiments where we train our proposed model (ADNet), and the closest competing model (ALPNet) with the two different self-supervision tasks. From the results in Table 7, we see that the supervoxels overall yield better or comparable results for both models. For our proposed ADNet, there is a significant improvement (p < 0.05) in dice score from pSSL to vSSL for both datasets. Moreover, the improvements appear most prominent for the abdominal dataset, which is assumed to be related to the nature of the image volumes: In the abdominal dataset, the image volumes contain more slices and more potential information to utilize when the self-supervision task is extended to 3D, compared to the cardiac dataset.
A different implication of the proposed extension to supervoxel-based self-supervision is the enabling of training 3D CNNs for direct volume segmentation, as discussed in the next section.

Extension to one-step volume segmentation
Thus far, we have adopted a hybrid strategy to 3D segmentation, following Ouyang et al. (2020), where the 3D image volumes are segmented slice by slice, independently. However, a natural extension that is facilitated by the new self-supervision task is to adopt a 3D CNN as backbone to process the volumes in one step, thereby fully exploiting the potentially useful information along the third axis. Unfortunately, the high memory consumption and computational cost of 3D CNNs has limited their use to smaller images (in number of voxels), often obtained by down-sampling the original images (Ç içek et al., 2016) or by patch-based approaches (Huo et al., 2019).
To investigate the potential of utilizing 3D convolutions to do one-step 3D segmentations within our proposed framework, we employ a 3D ResNeXt-101 (Hara et al., 2018), which is the 3D extension of ResNeXt (Xie et al., 2017), pretrained on the Kinetics-600 dataset (Kay et al., 2017), as our encoder network. The 3D ResNeXt-101 is a more resource efficient network, compared to the 3D ResNet-101, with approximately half Table 7. Mean dice score and standard deviation over three runs per split for ADNet and ALPNet with superpixel-based and supervoxel-based selfsupevision. * indicates that the increase in mean dice score for the best performing model is statistically significant (p < 0.05).  Table 8).
To retain the same spatial resolution in the embedding space as for our 2D backbone, we modify the network by i) removing the maxpooling in z-direction and ii) changing the strides in conv 3, conv 4, and conv 5 to (1, 2, 2), (1, 1, 1), and (1, 1, 1), respectively (see architecture details in Table 9). Similarly to the 2D ResNet-101, we replace the classifier with 1 × 1 × 1 convolutions to reduce the feature dimension from 2048 to 256. Each voxel is repeated three times along the channel dimension in the input to fit into the pretrained network. The network is trained self-supervised end-to-end on 3D patches of size (10, 215, 215), and the loss is optimized according to Section 4.3. During inference, we evaluate the performances under EP2 with two different levels of supervision: i) Only labeling the middle slice of the target class in the support volume (k = one), as is done in the 2D experiments. ii) Labeling all the support slices (k = all) and computing one prototype for the entire support volume, which is enabled by the volume-wise embedding. Table 8 provides a summary of the performance of vSSL-ADNet with 3D ResNeXt-101 and 2D ResNet-101 backbones. Though it is difficult to directly compare 2D CNNs and 3D CNNs for many different reasons, such as difference in pretraining datasets and the number of weights modelling relations within slices and between slices, the results are meant to indicate the potential of using 3D convolutions in our framework to perform one-step 3D segmentation.
From the results on the cardiac dataset, we see that the differences between 2D and 3D are relatively small, which agrees with observations in previous work (Vesal et al., 2019). In the abdominal dataset, on the other hand, there appears to be a greater potential for utilizing the 3D structure via 3D convolutions. This mirrors our results from Section 5.3.5, where we found that the abdominal dataset benefited more from extending the self-supervision task from superpixels to supervoxels.
The largest performance difference between the backbones can be observed for the left kidney and spleen classes. While the 2D CNN results in a segmentation where these classes are confused, the 3D CNN leads to a better separation between the classes, as illustrated in Fig. 10. We further observe a drop in performance on the right kidney class for the 3D CNN with k = 1, which demonstrates the importance of having good support features to achieve robust results with the 3D backbone.

Limitations and Outlook
The key observation leading to our anomaly-detection inspired few-shot medical image segmentation is that the foreground class typically is relatively homogeneous. By only modeling the foreground class with a single prototype, we avoid having to model the large and highly inhomogeneous back- ground, which we believe is the main challenge in prototypical few-shot medical image segmentation. However, if our assumption of a relatively homogeneous foreground class is not met, and the foreground consists of multiple distinct regions with strong edges, e.g. combining left-ventricle blood pool and leftventricle myocardium into one foreground class (left-ventricle), modeling the foreground with one prototype might not be sufficient. This is related to the nature of the supervoxels, which tend to follow the boundaries of the structures in the image; Left-ventricle blood pool and left-ventricle myocardium will typically belong to different supervoxels during training and the network therefore learns to separate their feature representations into different clusters. To be able to capture this combined foreground class during inference, one option could be to take inspiration from PPNet (Liu et al., 2020b) and cluster the features into multiple foreground prototypes and then merge the results.
Both the superpixel-based and the supervoxel-based selfsupervision tasks are inevitably vulnerable to merging different classes during training if the boundaries between them are weak: If the boundaries are weak, the classes will end up in the same superpixel/voxel and the network learns to embed the classes into the same cluster, which makes them difficult to separate during inference. Moreover, in the supervoxel case, it is enough for one slice to contain a weak boundary between the classes before they leak into the same supervoxel. This is something that happens between the left-kidney and the spleen in the abdominal dataset, and leads to confusion between these two classes during inference, thereby resulting in lower dice scores and high standard deviations. Taking into account this weak/noisy nature of the supervoxel pseudo-labels is a promising direction for future research.

Conclusion
In this work, we proposed a novel and end-to-end trainable anomaly detection-inspired FSS network for medical image segmentation. By approaching the segmentation task as an anomaly detection problem, our model eliminates the need to explicitly model the large and heterogeneous background class. Moreover, to train the model in an unsupervised manner, we introduced a new self-supervision task that captures the 3D nature of the data by utilizing supervoxels. We assessed our proposed model on representative datasets for cardiac segmentation and abdominal organ segmentation, and showed that it improves segmentation performance and robustness, especially in the realistic scenario where no weak labels for the query images are assumed. Furthermore, we demonstrated how the proposed model, together with the new self-supervision task, has the potential to perform one-step 3D segmentation of the entire image volumes. We believe that fully exploiting the 3D nature of the medical images in this manner for few-shot segmentation represents an interesting line of research for future work.