Sauron U-Net: Simple automated redundancy elimination in medical image segmentation via filter pruning

We introduce Sauron, a filter pruning method that eliminates redundant feature maps of convolutional neural networks (CNNs). Sauron optimizes, jointly with the loss function, a regularization term that promotes feature maps clustering at each convolutional layer by reducing the distance between feature maps. Sauron then eliminates the filters corresponding to the redundant feature maps by using automatically adjusted layer-specific thresholds. Unlike most filter pruning methods, Sauron requires minimal changes to typical neural network optimization because it prunes and optimizes CNNs jointly, which, in turn, accelerates the optimization over time. Moreover, unlike with other cluster-based approaches, the user does not need to specify the number of clusters in advance, a hyperparameter that is difficult to tune. We evaluated Sauron and five state-of-the-art filter pruning methods on four medical image segmentation tasks. This is an area where little attention has been paid to filter pruning, but where smaller CNN models are desirable for local deployment, mitigating privacy concerns associated with cloud-based solutions. Sauron was the only method that achieved a reduction in model size of over 90% without deteriorating substantially the performance. Sauron also achieved, overall, the fastest models at inference time in machines with and without GPUs. Finally, we show through experiments that the feature maps of models pruned with Sauron are highly interpretable, which is essential for medical image segmentation.


Introduction
Pruning is the process of eliminating unnecessary parameters to obtain compact models and accelerate their inference.There are two main strategies for pruning convolutional neural networks (CNNs): weight pruning and filter pruning.In weight pruning, weights for unimportant connections are zeroed without consideration of the network structure, leading, in practice, to sparse weight matrices [1,2,3,4,5].On the other hand, filter pruning methods eliminate CNNs filters directly.Thus, unlike weight-pruned models, utilizing filter-pruned networks efficiently requires no specialized hardware or software [6,7].Most pruning methods have been developed or evaluated exclusively for natural image classification.Other tasks, such as medical image segmentation, have received signifi-cantly less attention [8].In medical imaging, small models can enable computationally-limited medical grade computers to segment medical images that cannot be uploaded to a cloud server due to privacy reasons.Moreover, models with a few filters can be easier to interpret than large models, which is crucial not only in clinical applications but also in research.Motivated by these possibilities, we propose a filter pruning method called Sauron that generates small CNNs.We demonstrate its application to prune U-Net-like networks [9], bringing together filter pruning and medical image segmentation.
Sauron applies filter pruning during optimization in a single phase, while most filter pruning frameworks consist of three distinct phases: Pre-training the model, pruning its filters, and fine-tuning to compensate for the loss of accuracy (or re-training from scratch [10,11]).Other ap-proaches combine pruning with training [12,13,14,15] or fine-tuning [16,17], resulting in two-phase frameworks, and other methods repeat these phases multiple times [12,16,18].Single-phase filter pruning methods [13], such as Sauron, are advantageous since they require fewer hyperparameters and design decisions, including the number of epochs for training and fine-tuning, pruning iterations, or whether to combine pruning with training or fine-tuning.In particular, Sauron does not insert additional parameters into the optimized architecture to identify filter candidates for pruning, such as channel importance masks [11,16,19,17,20].This avoids potential optimization hindrance and requires less extra training time and GPU memory.
Sauron facilitates and promotes the formation of feature map clusters by optimizing a regularization term, and, unlike previous cluster-based approaches [21,14,18], Sauron does not enforce the number of these clusters.Since these clusters vary depending on the training data and across layers, the optimal number of feature maps per cluster is likely to differ.Thus, determining the number of clusters is not trivial and may limit the accuracy and the pruning rate.
Our specific contributions are the following: • We introduce Sauron, a single-phase filter pruning method that resembles the typical CNN optimization, making it easier to use, and that does not add any additional parameters to the optimized architecture.
• We show that Sauron promotes the formation of feature map clusters by optimizing a regularization term.
• We compare Sauron to other methods on three medical image segmentation tasks, where Sauron resulted in more accurate and compressed models.
• We show that the feature maps generated by a model pruned with Sauron were highly interpretable.
• We publish Sauron and the code to run all our experiments at https://github.com/jmlipman/SauronUNet.

Previous work
Filter importance.Most filter pruning approaches rely on ranking filters to eliminate the unimportant filters.The number of eliminated filters can be determined by either a fixed [22] or an adaptive threshold [15].Filter importance can be found via particle filtering [22] or it can be computed via heuristic relying on measures such as L p norms [23,24,15], entropy [25], or post-pruning accuracy [26].Pruning methods can include extra terms in the loss function, such as group sparsity constraints, although these extra terms guarantee no sparsity in CNNs [27].Other methods aim to learn filter importance by incorporating channel importance masks into CNNs' architectures [11,16,19,17,20].However, these adjustments modify the architectures to be optimized, increasing the required GPU memory during training, optimization time, and potentially hindering the optimization.Alternatively, other methods consider the scaling factor of batch normalization layers as channel importance [27,13], but in e.g.medical image segmentation, batch normalization is occasionally replaced by other normalization layers due to the small mini-batch size [28].
Difference minimization.Methods that remove filters while trying to preserve characteristics such as classification accuracy [10], Taylor-expansion-approximated loss [12], and the feature maps [29,30,24,31] of the original unpruned models.A disadvantage of these methods is that they require a large GPU memory to avoid loading and unloading the models in memory constantly, which would slow down the training.Furthermore, since finding the appropriate filters for their elimination is NP-hard, certain methods resorted to selecting filters based on their importance [29,24,12], or via genetic [10] or greedy [31] algorithms.
Redundancy elimination.Approaches, including Sauron, that identify redundant filters by computing a similarity metric among all [32,33] or within clusters of filters/feature maps [14,21,18].Previously, cluster-based approaches have considered redundant those within-cluster filters near the Euclidean center [21] and median [14], or filters with similar L 1 norm over several training epochs [18].A disadvantage of these approaches is an extra "number of clusters" hyperparameter, which is data dependent and the same hyperparameter value might not be optimal across layers.Other methods have used Pearson's correlation between the weights [32] or between the feature maps [33] within the same layer, and feature maps' rank [34] to indicate redundancy, although, their computations are more expensive than utilizing distances as in cluster-based methods.

Sauron
In this section, we present our approach to filter pruning, which we call Simple AUtomated Redundancy elimi-natiON (Sauron).Sauron optimizes, jointly with the loss function, a regularization term that leads to clusters of feature maps at each convolutional layer, accentuating the redundancy of CNNs.It then eliminates the filters corresponding to the redundant feature maps by using automatically-adjusted layer-specific thresholds.Sauron requires minimal changes from the typical neural network optimization since it prunes and optimizes CNNs jointly, i.e., training involves the usual forward-backward passes and a pruning step after each epoch.Moreover, Sauron does not integrate optimizable parameters, such as channel importance masks [11,16,19,17,20], into the CNN architecture.This avoids complicating the optimization task and increasing the training time and the required GPU memory.Algorithm 1 summarizes our method.

Preliminaries
Let D = {x i , y i } N i=1 represent the training set, where x i denotes image i, y i its corresponding segmentation, and N is the number of images.Let W l ∈ R s l+1 ×s l ×k×k be the weights, composed by s l+1 s l filters of size k × k at layer l, where s l+1 denotes the number of output channels, s l the number of input channels, and k is the kernel size.Given feature maps O l ∈ R s l ×h×w of h × w image dimensions, the feature maps O l+1 ∈ R s l+1 ×h×w at the next layer are computed as where * is the convolution operation, Norm is a normalization layer, and σ is an activation function.For simplicity, we omit the bias term in Eq. ( 1), and we include all CNN's parameters in θ = {W 1 , . . ., W L }, where L is the number of layers.We denote the predicted segmentation of the image x i by ŷi .Compute δ opt (Eq.( 2)), and δ prune (Eq.( 3)) end for 24: end for Output: Pruned CNN.

Forward pass
Sauron minimizes a loss L consisting of Cross Entropy L CE , Dice loss L Dice [35], and a novel channel distance regularization term δ opt : L = L CE + L Dice + λδ opt , where λ is a hyperparameter that balances the contribution of δ opt , and φ denotes average pooling with window size and strides ω.Before computing δ opt , feature maps O l 1 and O l −1 (all channels except the first) are normalized to the range [0, 1] via min-max normalization, as we experimentally found this normalization strategy to be the best (see Appendix A).For pruning, Sauron computes distances between a randomly-chosen feature map π ∈ {1, . . ., s l+1 } and all the others: δ prune = {d l r / max r d l r : l = 1, . . ., L, r = 1, . . ., π − 1, π + 1, . . ., s l+1 }, where Importantly, π is different in every layer and epoch, enabling Sauron to prune different feature map clusters.Moreover, since finding an appropriate pruning threshold requires the distances to lie within a known range, Sauron normalizes d l r such that their maximum is 1, i.e., d l r ← d l r / max r (d l r ).

Backward pass: δ opt regularization
Optimized CNNs have been shown to have redundant weights and to produce redundant feature maps [14,32] (Appendix E).By minimizing the extra regularization term δ opt , CNNs further promote the formation of clusters, facilitating their subsequent pruning.δ opt regularization makes those feature maps near the feature map in the first channel O l 1 (i.e., within the same cluster) even closer.At the same time, those feature maps that are dissimilar to O l 1 (i.e., in other clusters) become more similar to other feature maps from the same cluster, as it holds that ||φ(O l , the right hand sideminimized via δ opt regularization-is an upper bound of the left hand side.We demonstrate this clustering effect in Section 4.2.Furthermore, for pruning, we focus on the feature maps rather than on the weights since different nonredundant weights can lead to similar feature maps.Thus, eliminating redundant weights guarantees no reduction in feature maps redundancy.

Pruning step
Sauron employs layer-specific thresholds τ = [τ 1 , . . ., τ L ], where all τ l are initialized to zero and increase independently (usually at a different pace) until reaching τ max .This versatility is important as the ideal pruning rate differs across layers due to their different purpose (i.e., extraction of low-and high-level features) and their varied number of filters.Additionally, this setup permits utilizing high thresholds without removing too many filters at the beginning of the optimization, as feature maps may initially lie close to each other due to the random initialization.In consequence, pruning is embedded into the training and remains always active, portraying Sauron as a single-phase filter pruning method.
Procedure 1: Increasing τ l .Pruning with adaptively increasing layer-specific thresholds raises two important questions: how and when to increase the thresholds?Sauron increases the thresholds linearly in κ steps until reaching τ max .Then, thresholds are updated once the model has stopped improving (C1 and C2 in Algorithm 1) and it has pruned only a few filters (C3).An additional "patience" hyperparameter ensures that the thresholds are not updated consecutively (C4).Conditions C1, . . .,C4 are easy to implement and interpret, and they rely on heuristics commonly employed for detecting convergence.
Procedure 2: Pruning.Sauron considers nearby feature maps to be redundant since they likely belong to the same cluster.In consequence, Sauron removes all input filters W l •,s l whose corresponding feature map distances δ prune are lower than threshold τ l .In contrast to other filter pruning methods, Sauron needs to store no additional information, such as channel indices, and the pruned models become more efficient and smaller.Additionally, since pruning occurs during training, Sauron accelerates the optimization of CNNs.After training, pruned models can be easily loaded by specifying the new post-pruning number of input and output filters in the convolutional layers.

Implementation
Sauron's simple design permits its incorporation into existing CNN optimization frameworks easily.As an example, in our implementation, convolutional blocks are wrapped into a class that computes δ opt and δ prune effortlessly in the forward pass, and the pruning step is a callback function triggered after each epoch.This implementation, together with the code for running our experiments and processing the datasets, was written in Pytorch [36] and is publicly available at https://github.com/jmlipman/SauronUNet.In our experiments, we utilized an Nvidia GeForce GTX 1080 Ti (11GB), and a server with eight Nvidia A100 (40GB).

Experiments
In this section, we compare Sauron with other stateof-the-art filter pruning methods and conduct an ablation study to show the impact on pruning and performance of δ opt regularization.We empirically demonstrate that the proposed δ opt regularization increases feature map clusterability, and we visualize the feature maps of a Sauronpruned model.
Datasets.We employed three 3D medical image segmentation datasets: Rats, ACDC, and KiTS.Rats comprised 160 3D T2-weighted magnetic resonance images of rat brains with lesions [37], and the segmentation task was separating lesion from non-lesion voxels.We divided Rats dataset into 0.8:0.2train-test splits, and the training set was further divided into a 0.9:0.1 train-validation split, resulting in 115, 13, and 32 images for training, validation, and test, respectively.ACDC included the Automated Cardiac Diagnosis Challenge 2017 training set [38] (CC BY-NC-SA 4.0), comprised by 200 3D magnetic resonance images of 100 individuals.The segmentation classes were background, right ventricle (RV), myocardium (M), and left ventricle (LV).We divided ACDC dataset similarly to Rats dataset, resulting in 144, 16, and 40 images for training, validation, and test, respectively.We only utilized ACDC's competition training set due to the limitation to only four submissions to the online platform of ACDC challenge.Finally, KiTS was composed by 210 3D images from Kidney Tumor Challenge 2019 training set, segmented into background, kidney and kidney tumor [39] (MIT).KiTS training set was divided into a 0.9:0.1 train-validation split, resulting in 183 and 21 images for training and validation.We report the results on the KiTS's competition test set (90 3D images).All 3D images were standardized to zero mean and unit variance.The train-validation-test divisions and computation of the evaluation criteria was at the subject level, ensuring that the data from a single subject was completely in the train set or in the test set, never dividing subject's data between train and test sets.See Appendix C for preprocessing details.
Model and optimization.Sauron and the compared filter pruning methods optimized nnUNet [28] via deep supervision [40] with Adam [41] starting with a learning rate of 10 −3 , polynomial learning rate decay, and weight decay of 10 −5 .During training, images were augmented with TorchIO [42] (see Appendix C). nnUNet is a selfconfigurable U-Net and the dataset optimized nnUNet architectures slightly differed on the number of filters, encoder-decoder levels, normalization layer, batch size, and number of epochs (see Appendix C).
Pruning.Sauron decreased feature maps dimensionality via average pooling with window size and stride of ω = 2, and utilized λ = 0.5 in the loss function, maximum pruning threshold τ max = 0.3, pruning steps κ = 15, and patience ρ = 5 (C4 in Algorithm 1).Additionally, we employed simple conditions to detect convergence for increasing the layer-specific thresholds τ.Convergence in the training loss (C1) was detected once the most recent training loss lay between the maximum and minimum values obtained during the training.We considered that the validation loss stopped improving (C2) once its most recent value increased with respect to all previous values.Finally, the remaining condition (C3) held true if the layer-specific threshold pruned less than 2% of the filters pruned in the previous epoch, i.e., µ = 2.

Benchmark on three segmentation tasks
We optimized and pruned nnUNet [28] with Sauron, and we compared its performance with cSGD1 [21], FPGM2 [14], and Autopruner 3 [16] using a pruning rate similar to the one achieved by Sauron.Since cSGD and FPGM severely underperformed in this setting, we re-run them with their pruning rate set to only 50% (r = 0.5).Additionally, to understand the influence of the proposed regularization term δ opt on the performance and pruning rate, we conducted ablation experiments with λ = 0. We computed the Dice coefficient [44] and 95% Hausdorff distance (HD95) [45] on Rats and ACDC test sets (see Tables 1 and 2).In KiTS dataset, only the average Dice coefficient was provided by the online platform that evaluated the test set (see Table 3).In addition to Dice and HD95, we computed the relative decrease in the number of floating point operations (FLOPs) in all convolutions: FLOPs = HW(C in C out )K 2 , where H, W is the height and width of the feature maps, C in , C out is the number of input  0.9556 0.7352 cSGD [21] (r = 0.5) 0.9047 0.5207 FPGM [14] (r = 0.5) 0.9509 0.6830 Autopruner [16] 0.9167 0.5854 and output channels, and K is the kernel size.For the 3D CNNs (KiTS dataset), an extra D (depth) and K are multiplied to compute the FLOPs.
Sauron obtained the highest Dice coefficients and competitive HD95s across all datasets and segmentation classes (Tables 1 to 3).Sauron also achieved the highest reduction in FLOPs, although, every method, including Sauron, can further reduce the FLOPs at the risk of worsening the performance (Table 4).cSGD and FPGM could not yield models with high pruning rates possibly because they aim at reducing only s l+1 and not s l from W l ∈ R s l+1 ×s l ×k×k .Thus, very high pruning rates cause a great imbalance between the number of input and output filters in every layer that may hinder the training.Note also that cSGD and FPGM were not tested with pruning rates higher than 60% [21,14].In contrast, Sauron and Autopruner that achieved working models with higher pruning rate reduced both input filters s l and output filters s l+1 .
Sauron without the proposed regularization term δ opt (Sauron (λ = 0)) achieved similar or less compressed models and worse Dice coefficients than when minimizing δ opt .Overall, the results from these ablation experiments indicate that 1) typical CNN optimization (without δ opt regularization) yields redundant feature maps that can be pruned with Sauron, 2) pruning rate is generally higher with δ opt regularization, and 3) pruning with no δ opt regularization can affect performance, possibly due to the accidental elimination of non-redundant filters.In summary, the pruning rate and performance achieved in our ablation experiments demonstrate that promoting clusterability via δ opt regularization is advantageous for eliminating redundant feature maps.

Minimizing δ opt promotes the formation of feature maps clusters
We investigated feature map clustering tendency during nnUNet's optimization.For this, we deactivated Sauron's pruning step and optimized L on Rats dataset with and without δ opt while storing the feature maps at each epoch (including at epoch 0, before the optimization) of every convolutional layer.Since quantifying clusterability is a hard task, we utilized three different measures: 1) We employed dip-test [46], as Adolfsson et al. [47] demonstrated its robustness compared to other methods for quantifying clusterability.High dip-test values signal higher clusterability.2) We computed the average number of neighbors of each feature map layer-wise.Specifically, we counted the feature maps within r, where r corresponded to the 20% of the distance between the first channel and the farthest channel.Distance r is computed every time since the initial distance between feature maps is typically reduced while training.An increase in the average number of neighbors indicates that feature maps have become more clustered.
3) We calculated the average distance to the first feature map channel (i.e., δ opt ) for each layer, which illustrates the total reduction of those distances achieved during and after the optimization.
In agreement with the literature [14,32], Figure 1 shows that optimizing nnUNet (without δ opt regularization) yields clusters of feature maps.Feature maps in layer "dec block 1" (see Appendix B) show no apparent structure suitable for clustering at initialization (Fig. 1, a), and, at the end of the optimization, feature maps appear more clustered (Fig. 1, b). Figure 1 (d, blue line) also illustrates this phenomenon: dip-test value is low in the beginning and higher at the end of the training.However, this increasing trend did not occur in all layers.To illustrate this, we compared, for each layer, the average dip-test value, number of neighbors, and distance δ opt in the first and last third of the training.Then, we considered the trend similar if the difference between these values was smaller than 0.001 (for the dip-test values) or smaller than 5% of the average value in the first third (for the number of neighbors and distance δ opt ). Figure 1 (e) shows that the number of layers in which the dip-test value increased and decreased were similar when not minimizing the δ opt regularization term.In contrast, the number of layers with an increasing trend was proportionally larger with δ opt regularization.Figure 1 (f) shows a similar outcome regarding the average number of neighbors, i.e., δ opt regularization led to proportionally more neighbors near each feature map.In the same line, the average distance between the first feature map and the rest decreased more with δ opt regularization (Fig. 1, (f)).Additionally, Figure 1 (c) also illustrates that incorporating the δ opt regularization term enhances the clustering of feature maps, as there are more clusters and the feature maps are more clustered than when not minimizing δ opt (Fig. 1 (b)).
We observed higher clusterability in the convolutional layers with more feature maps (see Appendix D).This is likely because such convolutional layers contribute more to the value of δ opt (Eq.2).On the other hand, convolutional layers with fewer feature maps have larger feature vectors (e.g., enc block 1 feature vectors are (256 × 256) × 32 in Rats dataset) whose distances tend to be larger due to the curse of dimensionality.Sauron accounts, to some extent, for these differences in the convolutional layers with the adaptively-increasing layer-specific thresholds τ.
Another possible way to tackle these differences is by using different layer-specific λ's to increase the contribution of the distances of certain layers.We investigated the impact on feature map clusterability with higher λ values and, as illustrated in Figure 1 (h), a higher λ tended to increase the average number of neighbors, decrease δ opt , and somewhat increase the dip-test values, which, overall, signals higher clusterability.

Feature maps interpretation
Sauron produces small and efficient models that can be easier to interpret.This is due to δ opt regularization that, as we showed in Section 4.2, increases feature maps clusterability.Each feature maps cluster can be thought of as a semantic operation and the cluster's feature maps as noisy outputs of such operation.To test this view, we inspected the feature maps from the second-to-last convolutional block (dec block 8, see Appendix B) of a Sauronpruned nnUNet.For comparison, we included the feature maps from the same convolutional layer of the baseline (unpruned) nnUNet in Appendix E.
The first feature map depicted in Figure 2 (top) captured the background and part of the rat head that does not contain brain tissue.The second feature map contained the rest of the rat head without brain lesion, and the third feature map mostly extracted the brain lesion.Although the third feature map seems to suffice for segmenting the brain lesion, the first feature map might have helped the model by discarding the region with no brain tissue at all.Similarly, the first and second feature maps in Figure 2 (middle) detected the background, whereas feature maps 3, 4, and 5 extracted, with different intensities, the right cavity (red), myocardium (green), and left cavity (blue) of the heart.In Figure 2 (bottom), we can also see that each feature map captured the background, kidney (red), and tumor (blue) with different intensities.This high-level interpretation facilitates understanding the role of the last convolutional block which, in the illustrated cases, could be replaced by simple binary operations.This shows the interpretability potential of feature map redundancy elimination methods such as Sauron.

Conclusion
We presented our single-phase filter pruning method named Sauron, and we evaluated it on three medical image segmentation tasks in which Sauron yielded pruned models that were superior to the compared methods in terms of performance and pruning rate.In agreement with the literature, our experiments indicated that CNN optimization leads to redundant feature maps that can be clustered.Additionally, we introduced Sauron's δ opt regularization that, as we showed with three different clusterability metrics, increased feature maps clusterability without pre-selecting the number of clusters, unlike previous approaches.In other words, we enhanced CNN's innate capability to yield feature maps clusters via δ opt regularization, and we exploited it for filter pruning.Finally, we showed that the few feature maps after pruning nnUNet with Sauron were highly interpretable.
Limitations and potential negative impact.Sauron relies on feature maps for identifying which filters to prune.Thus, although Sauron is suitable for training models from scratch and fine-tuning pre-trained networks, Sauron is unable to prune CNNs without access to training data, unlike [23,32,48].Furthermore, Sauron cannot enforce a specific compression rate due to its simple distance thresholding.Although we have evaluated Sauron with respect to the segmentation quality, we are not able to evaluate the potential clinical impact.It could be that even a small difference in segmentation would have large clinical impact, or vice versa, a large difference in segmentation could be clinically meaningless.Depending on the application these impacts could be either positive or negative.nnUNet is a self-configurable U-Net optimized with extensive data augmentation, deep supervision, and polynomial learning rate decay.In our experiments, the configuration of its architecture and optimization settings depended on the dataset, as in the original publication [28].The architectural components that depended on the dataset were the following: • Number of levels: Number of block pairs in the encoder with different feature map sizes.The number of levels in Fig. 4 is five.After each even block in the decoder (except in dec block 2), nnUNet computes predictions at different resolutions, enabling deep supervision (green blocks in Fig. 4).
• Number of filters: Number of filters of the first two blocks in the encoder.The number of filters in every level doubles with respect to the previous level, unless it exceeds 480-maximum number of filters.
C. Configuration based on the dataset nnUNet architecture, its optimization, dataset preprocessing, and data augmentation strategy varied across datasets.Such disparity in configuration aimed to tailor each model and training settings to resemble as much as possible to previous studies that reported state-of-the-art performance [43,38,51,52,37].Tables 5, 6, and 7 list the configuration employed for each dataset.This configuration can also be seen in our publicly-available code.
C.1.Preprocessing Rats dataset was not preprocessed.ACDC and KiTS datasets were resampled to their median voxel resolution (Tables 6 and 7 report the final voxel resolution in mm.).In KiTS dataset, images from patients 15, 23, 37, 68, 125, 133 were discarded due to their faulty ground-truth segmentation.Intensity values were clipped to [−79, 304] and normalized by subtracting 101 and dividing by 76.9.Finally, images smaller than the patch size 160 × 160 × 80 were padded.

C.2. Data augmentation
During training, images from Rats, ACDC, and KiTS datasets were augmented via TorchIO [42].Images were randomly scaled and rotated with certain probability p. Their intensity values were altered via random gamma correction.Then, they were randomly flipped, and they were transformed via random elastic deformation.Particularly in ACDC dataset, 2D slices from the 3D volumes were cropped or padded to 320 × 320 voxels.

C.3. Architecture
The number of levels of the nnUNet models trained on Rats and KiTS datasets were five whereas in ACDC was seven.nnUNet was optimized on Rats, ACDC, and KiTS datasets with 32, 48, and 24 number of initial filters (enc block 1, B), respectively.The nnUNet models optimized on Rats and ACDC datasets were 2D whereas the model for KiTS dataset was 3D.Finally, the normalization layer utilized in Rats and KiTS datasets was Instance Normalization [50] whereas in ACDC was Batch Normalization [49].
C.4.Optimization All models were optimized with Adam [41] with a starting learning rate of 10 −3 , weight decay of 10 −5 , and polynomial learning rate decay: (1 − (e/epochs)) 0.9 .nnUNet was optimized for 200 epochs in Rats dataset and 500 epochs in ACDC and KiTS datasets.The batch size in Rats, ACDC, and KiTS datasets was four, ten, and two, respectively.Flip axis p = 0.5

D. Increase/decrease in clusterability metrics
Table 8 lists the relative increase/decrease in the three clusterability measures (dip-test value, distances δ opt , and average number of neighbors) for each convolutional layer.The increase/decrease is computed as the ratio between p 1 and p 2 , where p 1 is the average value during the first third of the optimization, and p 2 is the average value in the last third of the training.An increase in clusterability is indicated by 1) an increase in dip-test, 2) a decrease in δ opt , and 3) an increase in the average number of neighbors.

Figure 1 :
Figure1: a-c) tSNE plot of "dec block 1" feature maps at initialization (epoch 0), and after optimizing with and without δ opt .d) Corresponding dip-test values during the optimization.e-g) Summary of the trends across the three clusterability measures in all convolutional layers.h) Number of layers with an increasing trend in the three clusterability measures with higher values of λ (dashed line: Sauron's default configuration).

Figure 2 :Figure 4 :
Figure 2: Image slice from Rats (top), ACDC (middle), and KiTS (bottom) datasets, its ground-truth segmentation, and all feature maps at the second-to-last convolutional block after pruning with Sauron.

Table 1 :
Performance on Rats dataset.

Table 2 :
Performance on ACDC dataset.Bold: best performance among pruning methods.

Table 3 :
Performance on KiTS datasets.

Table 4 :
Decrease in FLOPs with respect to the baseline nnUNet.Bold: highest decrease.

Table 8 :
Name of the convolutional layer (see Section B), number of output filters, and relative increase/decrease in three clusterability measures.Gray: layers with 256 or more feature maps.Conv.Layer Filters Dip-test Distances δ opt Avg.neighbors