Constrained unsupervised anomaly segmentation

Current unsupervised anomaly localization approaches rely on generative models to learn the distribution of normal images, which is later used to identify potential anomalous regions derived from errors on the reconstructed images. However, a main limitation of nearly all prior literature is the need of employing anomalous images to set a class-specific threshold to locate the anomalies. This limits their usability in realistic scenarios, where only normal data is typically accessible. Despite this major drawback, only a handful of works have addressed this limitation, by integrating supervision on attention maps during training. In this work, we propose a novel formulation that does not require accessing images with abnormalities to define the threshold. Furthermore, and in contrast to very recent work, the proposed constraint is formulated in a more principled manner, leveraging well-known knowledge in constrained optimization. In particular, the equality constraint on the attention maps in prior work is replaced by an inequality constraint, which allows more flexibility. In addition, to address the limitations of penalty-based functions we employ an extension of the popular log-barrier methods to handle the constraint. Last, we propose an alternative regularization term that maximizes the Shannon entropy of the attention maps, reducing the amount of hyperparameters of the proposed model. Comprehensive experiments on two publicly available datasets on brain lesion segmentation demonstrate that the proposed approach substantially outperforms relevant literature, establishing new state-of-the-art results for unsupervised lesion segmentation, and without the need to access anomalous images.


Introduction
Deep learning models are driving progress in a wide range of visual recognition tasks, particularly when they are trained with large amounts of annotated samples. This learning paradigm, however, carries two important limitations. First, obtaining such curated labeled datasets is a cumbersome process prone to annotator subjectivity, limiting the access to sufficient training data in practice. This problem is further magnified in the context of medical image segmentation, where labeling involves assigning a category to each image pixel or voxel. In addition, even if annotated images are available, there exist some applications, such as brain lesion detection, where large intraclass variations are not captured during training, failing A popular strategy to tackle unsupervised anomaly segmentation is to model the distribution of normal images in the training set. To this end, generative models, such as generative adversarial networks (GANs) (Schlegl et al. (2017(Schlegl et al. ( , 2019; Andermatt et al. (2019); Ravanbakhsh et al. (2019); Baur et al. (2020); Sun et al. (2020)) and variational auto-encoders (VAEs) (Chen and Konukoglu (2018); Nick Pawlowski (2018); Sabokrou et al. (2019); Chen et al. (2020); Zimmerer et al. (2020)) have been widely employed. In particular, these models are trained to reconstruct their input images, which are drawn from a normal, i.e., healthy, distribution. At inference, input images are compared to their reconstructed normal counterparts, which are recovered from the learned distribution. Then, the anomalous regions are identified from the reconstruction error.
As an alternative to these methods, a few recent works have integrated class-activation maps (CAMs) during training Venkataramanan et al. (2020); Liu et al. (2020). In particular, Venkataramanan et al. (2020) leverage the generated attention maps as an additional supervision cue, enforcing the network to provide attentive regions covering the whole context in normal images. This term was formulated as an equality constraint with the form of a L 1 penalty over each individual pixel. Nevertheless, we found that explicitly forcing the network to produce maximum attention values across each pixel does not achieve satisfactory results in the context of brain lesion segmentation. In addition, recent literature in constrained optimization for deep neural networks suggests that simple penalties -such as the function used in Venkataramanan et al. (2020)-might not be the optimal solution to constraint the output of a CNN (Kervadec et al. (2019c)).
Based on these observations, we propose a novel formulation for unsupervised semantic segmentation of brain lesions in medical images. The key contributions of our work can be summarized as follows: • A novel constrained formulation for unsupervised lesion segmentation, which integrates an auxiliary constrained loss to force the network to generate attention maps that cover the whole context in normal images.
• In particular, we leverage global inequality constraints on the generated attention maps to force them to be activated around a certain target value. This contrasts with the previous work in Venkataramanan et al. (2020), where local pixel-wise equality constraints on Grad-CAMs Selvaraju et al. (2020) are employed. In addition, to address the limitations of penalty-based functions, we resort to an extended version of the standard log-barrier.
• Furthermore, we consider an alternative regularization term that maximizes the Shannon entropy of the attention maps, reducing the amount of hyperparameters with respect to the extended log-barrier model, while yielding at par performances.
• We benchmark the proposed model against a relevant body of literature on two public lesion segmentation benchmarks: BraTS and Physionet-ICH datasets. Comprehensive experiments demonstrate the superior performance of our model, establishing a new state-of-the-art for this task.
This journal version provides a substantial extension of the conference work presented in (Silva-Rodríguez et al., 2021). First, we extended the literature survey, particularly for unsupervised medical image segmentation. Then, in terms of methodology, the current version introduces several important modifications. In particular, we further investigate the role of the gradients on the attention maps derived from Grad-CAM in the task of unsupervised anomaly detection. Based on our empirical observations, we modify the formulation in Silva-Rodríguez et al. (2021) to constraint directly the activation maps without involving any gradient information. Furthermore, we propose an alternative learning objective for our constrained problem based on the Shannon entropy. More concretely, we replace our log-barrier formulation by a maximizing entropy term on the softmax activation of brain tissue pixels, which reduces the complexity in terms of hyperparameters with respect to the former model. Last, we add comprehensive experiments to empirically validate our method, including an additional dataset and extensive ablation studies on several design choices.

Unsupervised anomaly segmentation
Unsupervised anomaly segmentation aims at identifying abnormal pixels on test images, containing, for example, lesions on medical images (Baur et al. (2020); Chen and Konukoglu (2018)), defects in industrial images (Bergmann et al. (2019); Liu et al. (2020); Venkataramanan et al. (2020)) or abnormal events in videos (Abati et al. (2019); Ravanbakhsh et al. (2019)). A main body of the literature has explored unsupervised deep (generative) representation learning to learn the distribution from normal data. The underlying assumption is that a model trained on normal data will not be able to reconstruct anomalous regions, and the reconstructed difference can therefore be used as an anomaly score. Under this learning paradigm, generative adversarial networks (GAN) (Goodfellow et al. (2014)) and variational autoencoders (VAE) (Kingma and Welling (2014)) are typically employed. Nevertheless, even though GAN and VAE model the latent variable, the manner in which they approximate the distribution of a set of samples differs. GAN-based approaches (Schlegl et al. (2017(Schlegl et al. ( , 2019; Andermatt et al. (2019); Ravanbakhsh et al. (2019); Baur et al. (2020); Sun et al. (2020)) approximate the distribution by optimizing a generator to map random samples from a prior distribution in the latent space into data points that a trained discriminator cannot distinguish. On the other hand, data distribution is approximated in VAE by using variational inference, where an encoder approximates the posterior distribution in the latent space and a decoder models the likelihood (Sabokrou et al. (2019); Dehaene et al. (2020)). Recent literature on unsupervised anomaly segmentation also includes non VAE and GAN based approaches. For instance, (Bergmann et al., 2020) exploits the teacher-student learning paradigm, highlighting anomalies on those outputs where the student networks and teacher model predictions differ. Additionally, feature-based methods (Shi et al., 2021;Bergmann et al., 2020), which identify anomalies in the feature space can be also employed.

Unsupervised anomaly segmentation in medical imaging
In the context of medical images, most current literature resorts to VAEs, proposing several improvements to overcome specific limitations of simple VAEs (Chen and Konukoglu, 2018;Nick Pawlowski, 2018;Chen et al., 2020;Zimmerer et al., 2019). For example, to handle the lack of consistency in the learned latent representation on prior works, Chen and Konukoglu (2018) included a constraint that helps mapping an image containing abnormal anatomy close to its corresponding healthy image in the latent space. Zimmerer et al. (2019) presented a context-encoding VAE that combines reconstruction-with density-based anomaly scoring to capture the highlevel structure present in the data. More recently, a probabilistic model that uses a network-based prior as the normative distribution on the latent-variable model was proposed in (Chen et al., 2020). In particular, this model penalized large deviations between the reconstructed and original input images, reducing false positives in pixel-wise predictions. Generative models have been also employed to tackle the unsupervised lesion segmentation task (Baur et al., 2020;Nguyen et al., 2021). While SteGANomaly (Baur et al., 2020) integrated a CycleGAN-based style-transfer framework to map samples in the latent space much closer to the training distribution, Nguyen et al. (2021) mask out random regions of the input data before they are fed to the GAN model. Note that a detailed survey on unsupervised anomaly localization in medical imaging can be found in Baur et al. (2021). However, despite the recent popularity of these methods, the results from the Medical Out-of-Distribution Analysis Challenge 2020 (Zimmerer et al. (2022)) highlight their suboptimal performance on anomaly segmentation, which might impede their usability in clinical practice, as stressed by Meissen et al. (2022).
More recently, Venkataramanan et al. (2020) integrate attention maps derived from Grad-CAM (Selvaraju et al. (2020)) during the training as supervisory signals. In particular, in addition to standard learning objectives, authors introduce an auxiliary loss that tries to maximize the attention maps on normal images by including an equality constraint with the form of a L 1 penalty over each individual pixel.

Constrained segmentation
Imposing global constraints on the output predictions of deep CNNs has gained attention recently, particularly in weakly supervised segmentation. These constraints can be embedded into the network outputs in the form of direct loss functions, which guide the network training when fully labeled images are not accessible. For example, a popular scenario is to enforce the softmax predictions to satisfy a prior knowledge on the size of the target region. Jia et al. (2017) employed a L 2 penalty to impose equality constraints on the size of the target regions in the context of histopathology image segmentation. In Zhang et al. (2017), authors leverage the target properties by enforcing the label distribution of predicted images to match an inferred label distribution of a given image, which is achieved with a KL-divergence term. Similarly, Zhou et al. (2019) proposed a novel loss objective in the context of partially labeled images, which integrated an auxiliary term, based on a KL-divergence, to enforce that the average output size distributions of different organs approximates their empirical distributions, obtained from fully-labeled images.
While the equality-constrained formulations proposed in these works are very interesting, they assume exact knowledge of the target size prior. In contrast, inequality constraints can relax this assumption, allowing much more flexibility. In Pathak et al. (2015), authors imposed inequality constraints on a latent distribution -which represents a "fake" ground truth-instead of the network output, to avoid the computational complexity of directly using Lagrangian-dual optimization. Then, the network parameters are optimized to minimize the KL divergence between the network softmax probabilities and the latent distribution. Nevertheless, their formulation is limited to linear constraints. More recently, inequality constraints have been tackled by augmenting the learning objective with a penalty-based function, e.g., L 2 penalty, which can be imposed within a continuous optimization framework (Kervadec et al. (2019c,a); Bateson et al. (2021)), or in the discrete domain ). Despite these methods have demonstrated remarkable performance in weakly supervised segmentation, they require that prior knowledge, exact or approximate, is given. This contrasts with the proposed approach, which is trained on data without anomalies, and hence the size of the target is zero.

Methodology
An overview of our method is presented in Fig. 1.
In what follows, we describe each component of our methodology.
Preliminaries. Let us denote the set of unlabeled training images as D = {x n } N n=1 , where x i ∈ X ⊂ R Ω i represents the i th image and Ω i denotes the spatial image domain. This dataset contains only normal images, e.g., healthy images in the medical context, and has therefore no segmentation mask associated with each image. We now define an encoder, f θ (·) : X → Z, parameterized by θ, which is optimized to project normal data points in D into a manifold represented by a lower dimensionality d, z ∈ Z ⊂ R d . Furthermore, a decoder f φ (·) : Z → X parameterized by φ aims at reconstructing an input image x ∈ X from z ∈ Z, which results inx = f φ ( f θ (x)).

Vanilla VAE
A Variational Autoencoder (VAE) is an encoderdecoder style generative model, which is currently the dominant strategy for unsupervised anomaly location. Training a VAE consists in minimizing a two-term loss function, which is equivalent to maximize the evidence lower-bound (ELBO) (Kingma and Welling (2014)): where L R is the reconstruction error term between the input and its reconstructed counterpart. The right-hand term is the Kullback-Leibler (KL) divergence (weighted by β) between the approximate posterior q θ (z|x) and the prior p(z), which acts as a regularizer, penalizing approximations for q θ (z|x) that differ from the prior.

Size regularizer via VAE attention
Very recent literature (Liu et al. (2020); Venkataramanan et al. (2020)) has explored the use of attention maps for anomaly localization. In particular, attention maps a ∈ R Ω i are generated from the latent mean vector z µ , by using Grad-CAM (Selvaraju et al. (2020)) via backpropagation to an encoder block output f s θ (x), at a given network depth s. Thus, for a given input image x n its corresponding attention map is computed as follows: where K is the total number of filters of that encoder layer, σ a sigmoid operation, and α k are the generated gradients such that: where Ω T is the spatial features domain.
In Venkataramanan et al. (2020), authors leveraged the Grad-CAMs based attention maps (Eq.2) by enforcing them to cover the whole normal image. To achieve this, their loss function was augmented with an additional term, referred to as expansion loss, which takes the form of: L s = 1 |a| l∈Ω i (1 − a n l ). We can easily observe that this term resembles to multiple equality constraints, one at each pixel, forcing the class activation maps to be maximum at the whole image in a pixel-wise manner (i.e., it penalizes each single pixel individually). Contrary to this work, we integrate supervision on attention maps by enforcing inequality constraints on its global target size.
Note that the use of the inequality constraints is motivated by the choice of the barrier function in the constrained problem, which is further detailed in Section 3.3. Hence, we aim at minimizing the following constrained optimization problem: where f c (a j ) = (1 − 1 |Ω i | l∈Ω i a n l ) is the constraint over the attention map from the j-th image, which enforces the generated attention map to cover the whole image. It is well-known in optimization that a penalty does not act as a barrier near the boundary of the feasible set (Boyd et al., 2004). In other words, a constraint that is satisfied results in a null penalty and gradient. Therefore, at a given gradient update, there is nothing that prevents a satisfied constraint from being violated, causing oscillations between competing constraints and ultimately resulting in a potential unstable training. This is further exacerbated in the case of many multiple constraints (i.e., Venkataramanan et al. (2020)), motivating the use of a single global constraint to achieve a maximum coverage of class-activation maps over the whole image in our scenario. From Eq. 3 we can derive an approximate unconstrained optimization problem by employing a penalty-based method, which takes the hard constraint and moves it into the loss function as a penalty term (P(·)): min θ,φ L V AE (θ, φ) + λP( f c (a)). Thus, each time that the constraint f c (a n ) ≤ 0 is violated, the penalty term P( f c (a n )) increases.

Extended log-barrier as an alternative to penaltybased functions
Despite having demonstrated a good performance in several applications (Kervadec et 2017)) penalty-based methods have several drawbacks. First, these unconstrained minimization problems have increasingly unfavorable structure due to ill-conditioning (Fiacco and Mc-Cormick (1990); Luenberger (1973)), which typically results in an exceedingly slow convergence. Second, finding the optimal penalty weight is not trivial. In addition, we advocate for the use of the log-barrier extension versus penalties due to the strictly positive gradient of the latter becomes higher when a satisfied constraint approaches violation during optimization, pushing it back towards the feasible set (See Figure 1 in Kervadec et al. (2019c)). As explained in the previous section, this contrasts with penalties, as they deliver null gradients if a given constraint is satisfied. To address these limitations, we replace the penalty-based functions by the approximation of log-barrier 1 presented in Kervadec et al. (2019c). We would like to stress that barrier methods require the interior of the feasible sets to be non-empty and they are used, therefore, in constrained optimization problems with inequality constraints, such as the one defined in Eq. 3 (note that there is no interior for equality constraints). Thus, we can formally define the approximation of log-barrier as: where t controls the barrier during training, and z is the constraint f c (a n ). Thus, by taking into account the approximation in 4, we can solve the following unconstrained problem by using standard Gradient Descent: In this scenario, for a given t, the optimizer will try to find a solution with a good compromise between minimizing the loss of the VAE and satisfying the constraint f c (a n ). In the following, we refer to this formulation of gradient-CAM constraint as GradCAMCons setting.

On the role of gradients in VAEs
Even though there exist a few initial attempts to integrate attention maps on the task of unsupervised anomaly detection, how gradient-based attention behave on anomalous patterns remains unclear. For instance, Liu et al. (2020) argue that anomalies produce larger gradients in the learned latent representation, which results in higher activated attention maps. On the other hand, Venkataramanan et al. (2020) states that the VAE only focus on normal patterns (with which it has been trained), thus anomalous regions produce smaller absolute value gradients. These inconsistencies in the literature have motivated us to analyze the underlying role of the gradients in the context of brain images analysis. Thus, we performed several experiments to analyze the behaviour of grad-CAMs in anomaly localization compared to non-weighted activation maps (AMs), which are computed as: In particular, we could not find any benefit on gradients weighting other than serving as a scaling factor for attention maps to fall on non-saturated range of values of typically used activation functions, such as the sigmoid operation in Eq. 2 (see Figure 2, where we show that the values obtained by both types of attention are highly correlated). Furthermore, we found that the reconstructed images derived from the gradient-based attention contained more errors compared to those reconstructed with attention on the activation maps (Eq 6). We refer the reader to Section 1 of Supplemental Material for the detailed results concerning the role of the gradients.

Entropy maximization as a proxy for the constraint
Based on our previous findings, we advocate that the use of non-weighted activation maps (AMs) should be preferred over their gradient-based counterpart. Nevertheless, this solution has a main limitation that hinders the use of size constraints. As the activation maps are not normalized, the arbitrary activation value to impose the constraint loses the sense of size or proportion. The activation values produced by neural networks can vary in each application, as well as with the architecture used, which makes it difficult to establish generalizable restrictions on their value. For this reason, we propose to use attention maps derived from normalizing the activation maps over all the pixels of the image, via a softmax activation, similarly to Ilse et al. (2018), such that: p n = τ Ω B (a n ) 2 . Since these attention maps are normalized across pixels and not over classes, the use of global constraints is meaningless, as the sum over all the pixels post-softmax will be equal to 1.0. Nevertheless, we still aim at regularizing the attention distribution p n to focus on all patterns in the image homogeneously. To this end, we propose to minimize the KL distance D KL (p||q) = H(p, q)−H(p) between the attention distribution p, and a constant distribution q, where H(p, q) represents the cross-entropy between both distributions, and H(p) = H(p, p) is the Shannon entropy of the intensity distribution such that H(p) = − 1 I i p i · log(p i ). In the scenario where we want p to match a constant distribution, it is straightforward to see that minimizing the KL distance is equivalent to maximizing the entropy H(p): where = c indicates equality up to an additive constant. Thus, the proposed constrained optimization problem integrating an entropy maximization term, referred to as L H , offers a softer attention constraint compared to the solution in Eq. 5. Furthermore, this formulation allows the VAE to keep the most suitable activation values, while requiring less hyper-parameters to be optimized. Analogously to Eq. 5, we solve the constrained optimization problem with L H by using standard Gradient Descent: Hereafter, we will refer to this formulation as AMCons.
3.6. Inference During inference, we use the generated attention as an anomaly saliency map. For the Grad-CAMs based settings we replaced the sigmoid operation by a minimummaximum normalization in order to avoid saturation caused by large activations. During the experimental stage, we found that anomalies produce larger activation on attention maps than the constrained normal samples, in line to prior literature (Liu et al. (2020)). Then, the map is thresholded to create an anomaly mask of the image.

Datasets
The experiments described in this work are carried out in the context of brain lesions localization. Concretely, we use two relevant neuroimaging challenges: tumour segmentation in MRI volumes and intracranial hemorrhage (ICH) segmentation in CT scans.
Brain tumor segmentation. For this task, we used the popular BraTS 2019 dataset (Menze et al. (2015); Bakas et al. (2017Bakas et al. ( , 2018), which contains 335 multiinstitutional multi-modal MR scans with their corresponding Glioma segmentation masks. Following Baur et al. (2019), from every patient, 10 consecutive axial slices of FLAIR modality of resolution 224 × 224 pixels were extracted around the center to get a pseudo MRI volume. Then, the dataset is split into training, validation and testing groups, with 271, 32 and 32 patients, respectively. Following the standard literature, during training only the slices without lesions are used as normal samples. For validation and testing, scans with less than 0.01% of tumour are discarded, following the standard practices in the literature.
Intracranial hemorrhage segmentation. We use the Physionet-ICH dataset (Hssayeni (2020); Hssayeni et al. (2020a); Goldberger et al. (2000)) to localize intracranial hemorrhage lesions. The dataset is composed of 82 non-contrast CT scans of subjects with traumatic brain injury. From those, 36 cases are diagnosed with intracranial hemorrhage of different types: Intraventricular, Intraparenchymal, Subarachnoid, Epidural and Subdural. ICH Lesions were slice-wise delineated by two expert radiologists. In our work, we join the different ICH types into one single label for binary lesion segmentation. CT scans are skull-stripped, intensity-normalized, and co-registered into a reference scan. Similar to the BraTS dataset, 10 consecutive axial slices of resolution 224 × 224 pixels around the center were extracted to get CT pseudo volumes. The dataset is divided into training, validation and testing splits. The first one contains only non-ICH cases (n=46), while cases with labeled lesions were used for validation (n=6) and testing (n=30). Although the main core of ablation experiments in this work are described on the BraTS dataset, we use the Physionet-ICH dataset to demonstrate the generalization capabilities of our proposed method on different brain lesions and imaging modalities.

Evaluation Metrics
We resort to standard metrics for unsupervised brain lesion segmentation, as in Baur et al. (2021). Concretely, we compute the dataset-level area under precision-recall curve (AUPRC) at pixel level, as well the area under receptive-operative curve (AUROC). From the former, we obtain the operative point (OP) as threshold to generate the final segmentation masks. Then, we compute the best dataset-level Sørensen-Dice score ( DICE ) and intersection-over-union ( IoU ) over these segmentation masks. Finally, we compute the average Sørensen-Dice score (DICE) over single scans. For each experiment, the metrics reported are the average of three consecutive repetitions of the training, to account for the variability of the stochastic factors involved in the process.

Implementation Details
The VAE architecture used in this work is based on the recently proposed framework in Venkataramanan et al. (2020). Concretely, the convolution layers of ResNet-18 (He et al. (2016)) are used as the encoder, followed by a dense latent space z ∈ R 32 . For image generation, a residual decoder is used, which is symmetrical to the encoder. It is noteworthy to mention that, even though several methods have resorted to a spatial latent space (Baur et al. (2019); Venkataramanan et al. (2020)), we observed that a dense latent space provided better results, which aligns to the recent benchmark in Baur et al. (2021). To train the GradCAMCons formulation in eq. 5 we first trained the VAE during 50 epochs without any expansion to stabilize the convergence using β = 1. Then, the proposed regularizer was integrated (equation 5) with t = 10 and λ s = 10 3 applied to the Grad-CAMs obtained from the first convolutional block of the encoder during 250 epochs. We use a batch size of 8 images, and a learning rate of 1e−5 with ADAM as optimizer. The reconstruction loss, L R , in eq. (1) is the binary cross-entropy. Similarly, the AMCons formulation in eq. 8 was trained by using β = 10 and λ H = 0.1, using a learning rate of 1e−4. Ablation experiments to motivate the choice of values used are presented in Section 5.2 and Section 3 of supplemental materials. The code and trained models are publicly available on (https://github.com/jusiro/ constrained_anomaly_segmentation/).

Baselines
In order to compare our approach to state-of-the-art methods, we implemented prior works and validated them on the dataset used, under the same conditions. First, we use residual-based methods to match the recently benchmark on unsupervised lesion localization in Baur et al. (2021). Then, we implement up-to-date methods based on contrast adjustment on the input image via histogram equalization. We also include recently proposed methods that integrate CAMs to locate anomalies. For both strategies, the AE/VAE architecture was the same as the one used in the proposed method. Residual methods, given an anomalous sample, aim to use the AE/VAE to reconstruct its normal counterpart. Then, they obtain an anomaly localization map using the residual between both images such that m = |x−x|, where |·| indicates the absolute value. On the AE/VAE scenario, we include methods which propose modifications over vanilla versions, including context data augmentation in Context AE Zimmerer et al. (2019), Bayesian AEs (Nick Pawlowski (2018)), Restoration VAEs (Chen et al. (2020)), an adversarial-based VAEs, AnoVAEGAN (Baur et al. (2019)) and a recent GAN-based approach, F-anoGAN (Schlegl et al. (2019)). For methods including adversarial learning, DC-GAN Radford et al. (2016) is used as discriminator. During inference, residual maps are masked using a slighteroded brain mask, to avoid noisy reconstructions along the brain borderline. Equalization-based methods: very recent methods have highlighted the limits of residual-based approaches to properly discern brain lesions Meissen et al. (2021Meissen et al. ( , 2022. In contrast, they propose to apply an equalization of the histogram of the input image, and to set a threshold on the preprocessed image, considering that brain lesions often show hyperintense patterns in different modalities. Concretely, we include the method proposed in Meissen et al. (2021), which we refer to as HistEq. CAMs-based: we use Grad-CAM VAE (Liu et al. (2020)), which obtains regular Grad-CAMs on the encoder from the latent space z µ of a trained vanilla VAE. Concretely, we include a disentanglement variant of CAMs proposed in this work, which computes the combination of individually-calculated CAMs from each dimension in z µ , referred to as Grad-CAM D VAE. We also use the recent method in Venkataramanan et al. (2020) (CAVGA), which applies a L1 penalty on the generated CAM to maximize the attention. In contrast to our model and Liu et al. (2020), the anomaly mask in Venkataramanan et al. (2020) is generated by focusing on the regions not activated on the saliency map such that a = 1 − CAM, hypothesizing that the network has learnt to focus only on normal regions. Then, a is thresholded with 0.5 to obtain the final anomaly mask m ∈ R Ω i . For both methods, the network layer to obtain the Grad-CAMs is the same as in our method.

Comparison to the literature.
The quantitative results obtained by the proposed model and baselines on the test cohort are presented in Table 1. Results from residual-based baselines range between [0.056-0.511](AUPRC) and [0.188-0.525] (DICE), which are in line with previous literature Baur et al. (2021). We can observe that the proposed formulations outperform these approaches by a large margin. Concretely, the AMCons method provides a substantial increase of ∼34% and ∼26% in terms of AUPRC and DICE, respectively, compared to the best model, i.e., F-anoGAN. Furthermore, the model integrating the L H term significantly outperforms our previous method in Silva-Rodríguez et al. (2021). This supports our hypothesis that using non-weighted attention maps with a maximization entropy term as constraint is indeed a better solution for the unsupervised lesion segmentation task. Finally, in comparison with the very recently proposed method of histogram equalization, HistEq, our proposed formulation brings improvements of nearly ∼10% in the main figures of merit.

Ablation experiments
The following ablation studies aim at demonstrating, in an empirical way, the motivation of employing the proposed models. First, we provide quantitative evidences about the better performance of using global constraints (model in Eq. 5) over pixel-level constraints (i.e., Venkataramanan et al. (2020)). Second, we show that resorting to the extended log-barrier function is a better alternative than standard L2 penalty functions. Then, we perform an in-depth analysis of the optimal hyperparameters values for the entropy-guided model (Eq. 8), as well as other important design choices. Image vs. pixel-level constraint. The following experiment demonstrates the benefits of imposing the constraint on the whole image rather than in a pixel-wise manner, such as in Venkataramanan et al. (2020). In particular, we compare the two strategies when the constraint is enforced via a L2-penalty function, whose results are presented in Table 2. In particular, we can easily see that imposing the constraint at image-level consistently outperforms pixel-level constraints. These results support our hypothesis that global constraints, such as the proposed formulation in Eq. 5, should be preferred over multiple pixel-wise constraints, similar to Venkataramanan et al. (2020). Extended log-barrier vs. penalty-based functions. To motivate the choice of employing the extended log-barrier over standard penalty-based functions in the constrained optimization problem in Eq. (3), we compare them in Table 2. It can be observed that imposing the constraint with the extended log-barrier consistently outperforms the L2penalty, with substantial performance gains.
On the impact of entropy-guided constraints. We now perform an in-depth analysis of the effect of integrating the entropy-guided constraint in Eq. 8 for anomaly localization, as well as an extensive validation of the values of the balancing terms β and λ H . First, we study the impact of L H across different β values (i.e. β = {0.01, 0.1, 1, 10}), by fixing its balancing term λ H to 0.1, a value that empirically showed good stability. These results, which are reported in Figure 3a, show that the VAE with and without entropy constraint presents different optimal values for β. Nevertheless, the best results are obtained when the contribution of the regularization term is large (i.e. β ≥ 1), and the entropy-based regularization over the activation maps included (i.e., green bars). Furthermore, this configuration is shown to be more stable once a large β weight is set, particularly for the constrained formulation. Then, based on the best configuration (β = 10), we study how different λ H weights {0.01, 0.1, 1, 10} impact the model performance. These results (Figure 3b) show that incorporating the entropy regularization always contributes to performance gains, with an optimum weight value of λ H = 0.1.
In the next experiment, we show how adding the L H term in our formulation impacts the activation maps (AM). Concretely, we first show in Figure 4 the AM distribution for a normal sample for both the constrained and unconstrained configurations. It can be observed that, in our constrained formulation, the distribution of activation values is more homogeneous (in orange), unlike the more spread values found in its unconstrained counterpart (in green). Furthermore, we show its impact on unseen, anomalous samples, where the benefits of our model are better highlighted. In particular, we represent the AM distribution for normal and anomalous pixels on the unconstrained formulation (i.e. λ H = 0) in Figure 5 (top), and the effect of integrating the L H term (Figure 5, bottom). Similarly to the normal samples, the distribution of normal pixels produced by the unconstrained setting spreads over a larger range, resulting in a higher overlapping with the distribution of anomalous pixels. Note that, in addition to the overlapping regions, there exist values of normal pixels which overpass anomalous values. In contrast, the more compact distribution provided by the proposed formulation favors a smaller overlap between normal and anomalous pixel intensity distributions. This results in an easier identification of normal versus anomalous pixels.
In the following, we explore how the entropy constraint favors the smallest overlap between normal and anomalous distribution on the objective criteria, compared to previous literature. To do so, we depict in Figure 6 the distribution of both populations for the proposed methods, AMCons and GradCAMCons, and the most promising baselines, F-anoGAN and Histeq. Furthermore, we   obtain the overlap between both distributions by dividing the number of samples in the overlapped region of the histograms by the total number of samples. It can be seen how the proposed method based on entropy maximization obtains the smallest overlap (10.2%) and produces a narrower distribution of normal samples in comparison with the GradCAMCons method, based on size constraints.
Using statistics from normal domain for anomaly localization threshold. A common practice on unsupervised anomaly segmentation is to use anomalous images to define the threshold to obtain the final segmentation masks. In particular, these methods look at the AUPRC on the anomalous images, which is then used to compute the optimal threshold value. We refer to this technique in our experiments as OP (Operative Point). To alleviate the need of anomalous samples during the validation stage, several methods (Baur et al. (2019)) have discussed the possibility of using a given percentile from the normal images (i.e., no anomalies) distribution to set the threshold. Motivated by this, an ablation study on the percentile value is presented in Table 3 for our proposed formulations and the best performing baselines. First, we can observe that  under the OP strategy (i.e., accessing to anomalous images to identify the optimal threshold), both of our models bring substantial improvements over the state-of-theart on residual-based approaches, ranging from 14% to 22%. If we resort to the percentiles instead, the performance improvements observed are very similar to the OP scenario, with our models outperforming F-anoGAN by a large margin. Nevertheless, we observed that the best results are obtained with different percentile values. While F-anoGAN and AMCons w. L H yields the best performance using the 98% percentile, GradCAMCons w. L S follows previous observations in Baur et al. (2019), per- forming better using the 95% percentile.  This suggests that, even though not used directly, anomalous images are still required to find the optimal threshold value. However, the proposed method Grad-CAMCons shows special properties that suggest that they can achieve large performance gains without having access to anomalous images to define the threshold, unlike prior works. In particular, our GradCAM-based formulation restricts the attention values to [0, 1], which allows to set a typical threshold to 0.5, with still large performance gains (+7%) compared to the baselines. Nevertheless, we can observe that if we resort to the percentile strategy, our method based on maximizing the entropy of  the attention maps (i.e., AMCons) is very sensitive to the selected value.
Number of slices to generate the pseudo-volumes. In our experiments, we followed the standard literature (Baur et al. (2021)) to generate the pseudo-labels for validation and testing. Nevertheless, we concede that this scenario is unrealistic, as the appropriate number of slices used from the MRI scans in unsupervised anomaly detection should be unknown. We now explore the impact of including more slices in these pseudo-volumes, which increase the variability of normal samples. For instance, it is well-known that the target regions in slices farther from the center are incrementally smaller. In this line, we hypothesize that the dimension of the VAE latent space and the importance of the KL regularization may be a determining factors in absorbing this increased variability. Regarding the latent space, the appropriate z dimension is unclear in the literature. For instance, Baur et al. (2021) uses z = 128, while Baur et al. (2019) uses z = 64, and we obtained better results using z = 32. To validate the proposed experimental setting and latent space dimension, we now present results using increasing number of slices around the axial midline N = {10, 20, 40}, and two different latent space dimensions z = {32, 128} for both a standard VAE and our proposed models, in Figure 7a. We can observe that despite the gap between the baselines and the attention based methods is reduced as the number of slides is increased, this difference is still significant, and the relative performance drop is similar for all methods. Finally, we can observe that an increasing on z dimension (solid versus dotted lines in Fig 7a) does not produce gains in performance in any case. Note that the model hyperparameters used are optimized for z = 32, and N = 10, which also could produce some underestimation of the proposed model performance when N increases. In the following, we study the performance of the proposed AMCons method using different β values (β = {1, 10}) in the KL term of eq. 1 across different number of slices, whose results are presented in Figure 7b. We can observe that, by decreasing the value of β as the number of employed slices increases, we can alleviate the performance degradation observed with a fixed β. Since the KL regularization directly affects the capacity of the VAE for learning different samples, the optimization of its balancing term when increasing the domain of samples used seems necessary. The similar behaviour between the pro-posed method and baselines suggest that this could be a limitation of self-training features based on VAEs, which struggle to encode heterogeneous sample information.

Generalization to other datasets
In order to empirically demonstrate the generalization properties of the proposed methodology, we evaluate its performance on a different dataset for brain lesion detection. Concretely, as previously described, we resort to Physionet-ICH dataset for non-contrast CT on ICH localization. Implementation details are analogous as the ones used on the BraTS dataset, although we decreased the learning rate to 1e − 5, and we set a larger latent dimension, i.e. z ∈ R 128 , along all baselines and methods to favour model convergence. Obtained results for anomaly localization are reported in Table 4. Even though there exist slight differences in the comparison between residual methods in the literature compared to the results obtained on BraTS dataset (i.e. the simple AE out-performs variations approaches), the proposed attentionbased anomaly localization methods still achieve remarkable results. Again, the AMCons configuration yields the best performance, and it reaches improvements of nearly ∼25% and ∼18% in terms of AUPRC and DICE, respectively, compared to previous literature. The observed results suggest that the proposed methodology is able to generalize to other unsupervised brain lesion segmentation challenges, even using different imaging modalities. It should be noted, however, that the absolute results in terms of segmentation are lower than those obtained in BraTS. Among other reasons, this may be due to the greater heterogeneity observed in the ICH dataset, the lower degree of standardization and size of the database used, and the small size of ICH lesions, which penalizes metrics such as DICE. Nevertheless, the values obtained are in line with the scarce previous literature on ICH segmentation, as reflected in Table 4. Indeed, the obtained results are at par with previous works using a fully supervised learning approach Hssayeni et al. (2020b), which shows the difficulty of the task.

Qualitative evaluation
Visual results of the proposed and existing methods for both datasets are depicted in Figure 8. We can observe that our approach identifies as anomalous more complete regions of the lesions, whereas existing methods are prone to produce a significant amount of false positives (first, third and seventh rows) and fail to discover many abnormal pixels (third row). These visual results are in line with the quantitative validation performed in previous sections. However, there is a known problem about segmenting only hyperintense regions in the state-of-theart methods of unsupervised anomaly localization of brain lesions (Meissen et al. (2021)). Although the proposed method still suffers from this limitation (fourth row, red arrow), the positive results regarding true negative segmentation obtained in some normal, hyperintense tissue (second row, green arrow) suggest an improvement in relation to this problem.

Discussion
Despite the recent advances of unsupervised anomaly segmentation in medical problems, existing literature still provides limited performance, with most methods yielding suboptimal results in popular segmentation benchmarks. In this work, we have presented a novel approach that substantially differs from prior literature in several aspects.
First, we resort to generated attention maps to iden-tify anomalous regions, which contrasts with most existing works that rely on the pixel-wise reconstruction error. Second, our formulation integrates a size-constrained loss that enforces the attention maps to cover the whole image in normal images. This differs from very recent works Venkataramanan et al. (2020), as we tackle this problem by imposing inequality constraints on the whole target attention maps. Another important difference lies on the manner the constrained problem is addressed. While Venkataramanan et al. (2020) leverages a L2 penalty function, we resort to an extension of standard log-barrier methods, which overcome the well-known limitations of penalty-based methods. Quantitative results demonstrate that this model significantly outperforms prior literature on unsupervised lesion segmentation. A drawback of the log-barrier based formulation is that it requires to find the optimal value for several hyperparameters. Motivated by this, we have proposed an alternative model, which integrates a regularization term that maximizes the Shannon entropy on the generated attention maps. This new formulation only adds the entropy balancing term L H , which reduces the complexity compared to the constrained problem in eq. 5. Furthermore, as reported in the results, the maximum-entropy model yields better performance than the size regularizer formulation. Note, in addition, that the alternative entropy-based model better separates the intensity distributions between normal and abnormal tissue. This allows us to employ a higher percentile value to obtain the final anomalous regions, with a substantial performance improvement compared to previous methods. Thus, based on the reported empirical validation, the proposed models represent a novel state-of-the-art for unsupervised anomaly segmentation.
We believe that there exist potential research directions to further improve the performance of unsupervised segmentation methods. For example, brain images are typically acquired along multiple modalities. Learning how to combine multiple modalities in the scenario of anomalous regions detection might indeed enhance the learned representation by the VAE, ultimately resulting in better identification of abnormal pixels. In addition, unsupervised segmentation methods have been only evaluated from a discriminative perspective. Nevertheless, assessing their performances in terms of the quality of the uncertainty estimates, i.e., calibration, might give a better overview of the quality of a segmentation model.
1. On the role of the gradients on VAEs.
In this section, we describe the empirical analysis on the gradients role in attention-based anomaly detection using VAEs. To this end, a VAE is trained on normal brain MRI images, and attention maps are extracted for anomalous images. Concretely, we extract Grad-CAMs as defined in Eq.2, and non-weighted activation maps (AMs) as following Eq. 6. A representative case is shown in Figure 2 of the main manuscript. Under the explored setting, VAEs Grad-CAMs produce similar attention maps compared to so solely AMs. In particular, we could not find any benefit on gradients weighting other than serving as an scaling factor for attention maps to fall on non-saturated range of values of typically used activation functions, such as sigmoid operation in Eq.2. Although Grad-CAMs have been widely used in discriminative models to discern regions of interest in the image using class-specific gradients, its usefulness in generative models such as VAEs seems to be limited. In this case, the information encoded in the VAE seems closely related to the patterns detected by the convolutional filters in their early layers, without discarding any task-specific information.

Reconstructed images
In addition, we also studied the differences of applying constraints on attention maps using gradients (Grad-CAMCons setting in Eq. 5), or only activation maps 8 in terms of the quality of the reconstructed images. For this purpose, we show in Figure 1a the learning curves of reconstruction criteria for both methods in their optimal configurations (validated in their respective ablation experiments). In addition, we also show the corresponding results in terms of anomaly localization in Figure 1b. While the setting based on solely activations maps (AM-Cons) achieves the best performance, it is also able to bring the lowest reconstruction error. This may be because applying a direct supervision on gradients is too restrictive to optimize the VAE as a whole, compared to the softer criterion of entropy maximization in activation (a) (b) Figure 1: Study on the gradient influence on image reconstruction. Concretely, we compare grad-CAM based attention constraint (blue) and solely activation map regularization on reconstruction losses (a) and pixel level localization performance (b). Both methods are shown using the best hyperparameters obtained from their respective ablation experiments. maps we found that the reconstructed images obtained with the GradCAMCons model have lower quality than those provided by the AMCons formulation. Several examples are depicted in Figure 2.

Extraction of brain tissue Ω B
The AMCons formulation proposed in this work constraints the activation maps of brain tissue to be activated homogeneously, following Eq. 8. This training procedure requires to extract brain tissue pixels, Ω B , from background. To do so, we apply an Otsu's threshold to the image to obtain a binary tissue mask. Then, the mask y processed using a morphology closing operation with a diskshaped structural element of size 5 × 5 pixels. This approach for brain tissue extraction is capable of accurately separating the background from the foreground robustly, due to the observable difference in intensity between the two regions.

Additional ablation experiments.
In the following we present additional ablation experiments that justify the different hyperparameters used in the proposed methods during the experimental stage.
On the different approaches to size regularization. In order to enforce the attention maps to cover the whole normal images during VAE training, Venkataramanan et al. (2020) uses multiple penalties (one per pixel), that forces the activation to be maximum. However, it is well-known in optimization that a penalty does not act as a barrier near the boundary of the feasible set Boyd et al. (2004). In other words, a constraint that is satisfied results in a null penalty and gradient. Therefore, at a given gradient update, there is nothing that prevents a satisfied constraint from being violated, causing oscillations between competing constraints and ultimately resulting in a potential unstable training. This limitation motivates the methods we propose in this work, which uses a single constraint per image using the average activation. In addition, we advocate for the use of the log-barrier extension versus penalties due to the strictly positive gradient of the latter becomes higher when a satisfied constraint approaches violation during optimization. The improved dynamics during training are illustrated in Figure 3, which shows how the pixel-level penalty-based method (green line) shows a more unstable convergence. On the impact of the reconstruction losses. We evaluate the effect of including several well-known reconstruction losses in our formulation: SSIM and L 2 -norm. Table 1 reports the results from these experiments, where we can observe that, while BCE and SSIM reconstruction losses yield the best performances, integrating the L 2 -norm loss in our formulation degrades the performance of the proposed model. GradCAMCons setting optimization. To better understand the behaviour of the attention constraints in the proposed using Grad-CAMs and the attention expansion constraint (L S ), we resort to extensive ablation experiments to determine the optimal values of several model hyperparameters: the log-barrier t term, the weights of the attention loss on the training, λ s and, finally, the network depth used to compute the CAMs. Firstly, we empirically fix λ s = 10 3 , β = 1 and use the first convolutional block output to compute CAMs, to evaluate the impact of our model with t values incldued in {1, 10, 25, 50}. These results are reported in  The use of log-barrier extension constraints favour the model optimization and thus the performance on anomaly localization (see Table 3 in the main paper). Nevertheless, this configuration requires to empirically fix more parameters (i.e. t in Eq.4) than the formulation using a L 2 penalty. To alleviate this issue, we explore to use a predefined scheduler that incrementally increments the slope t during training. Concretely, the t value is scheduled such that t = 1 * 1.01 e , with e being the training epoch. The results presented in Table 2 shows a slight decrease of the results compared with the best configuration (t = 10). Nevertheless, the obtained results still outperform by a large margin the use of penalty-based methods (see Table 3 in the main paper), as well as other baselines.
We now validate the level depth from the encoder used to obtain the CAMs (i.e., network depth s in Section 3.2), with the best configuration from the previous ablation in Table 2. Results are presented in Table 3, from which we can observe that maximizing the attention in early layers leads to better results than in deeper layers. This could be produced by the better spatial definition of early layers, and the benefits that the proposed constrain produces in its later layers, which receive information from the whole image.  The experiments presented on the main paper are obtained using the best configuration: t = 10, β = 1 and λ s = 10 3 , with CAMs being obtained form the first convolutional block.
Using size constraints on AMCons. The use of Grad-CAMs on unsupervised anomaly segmentation is supported by the claim that gradients from the latent space allow the discrimination between normal and anomalous regions, as discussed in prior literature (Venkataramanan et al. (2020); Liu et al. (2020)). Nevertheless, we empirically found that in the VAE, gradients are highly correlated with simply the intermediate activations in the encoder, without any discriminating function (see Section 3.4 and Section 1 of Supplemental Material). Therefore, we propose to use only the activation maps in the constrained formulation, in order to not force the gradients during the VAE optimization (AMCons method). In this context, the use of a sigmoid activation (which saturates the activation values) to enforce size supervision that maximizes this activation has certain drawbacks. For example, the value of the activation maps depends on the architecture used, with higher values typically found in deeper layers. A direct result of this could be that, in the absence of gradient-based scaling, the activations are originally in the saturated zone of sigmoid activation. Furthermore, producing an artificial increase in activation values can move the generative model away from its stable configuration, damaging the encoding and reconstruction tasks. For these reasons, in the AMCons configuration, we make use of softmax activation, which normalizes the activations relative to the whole set of pixels, smoothing the applied supervision, while not forcing the activation values to settle around any value. The observer drawbacks of this configuration are confirmed by the empirical results, which are much worse than the proposed entropy constraint, L H , as shown in Table 4.  Table 4: Ablation study on the use of size constraints (L s ) in the activation maps based configuration, AMCons.

Model complexity.
In this section, we compare our formulation to existing approaches in terms of model complexity. Since previous residual-based methods require the generation of normal counterparts from anomalous images, they typically integrate an additional discriminator to create more realistic images, and require to use the trained generative decoder during inference. On the other hand, another interesting property of CAM-based anomaly detection is that it does not require using a decoder during inference stage. As indicated in Table 5, the proposed methods require less computational workload during inference. This phenomenon accentuates using AMCons method, since it does not needs gradients computed from the latent representation, but only intermediate activation maps on the encoder. Moreover, during training, the cost of adding a single constraint is negligible during training, as pointed out in previous literature on constraint optimization (Kervadec et al. (2019b)).  Table 5: Parameters of the proposed method and best performing baselines during both, training and inference stages.

Additional qualitative visualizations
In the following Figure 4 and Figure