Weakly Supervised Object Detection with 2D and 3D Regression Neural Networks

,


INTRODUCTION
A ttention maps can be computed to reveal discrimina- tive areas for the predictions of convolutional neural networks.Methods that compute such attention maps have originally been designed to make neural networks more explainable [1], [2], [3], [4].As these methods do not require image annotations for the optimization of the networks, but only global labels, they can also be used for weakly supervised detection.
Weakly supervised methods are especially promising for a large number of medical image analysis problems.Since medical expertise is scarce and annotation time expensive, unsupervised [5] and weakly supervised methods [6], [7] are most suited to extract information from large medical databases, in which labels are often either sparse or nonexistent.
We propose a novel weakly supervised detection E-mail: marleen.debruijne@erasmusmc.nl method, using an encoder-decoder network optimized with global labels.Combining the last feature maps of such an architecture enables the computation of attention maps at full input resolution, and small structures can be detected more accurately.In conventional approaches [8], [9], due to pooling layers in the network architecture, the feature maps used for the computation of attention maps would often be downsampled with respect to the input image.
Common practice is then to upsample the attention maps during postprocessing, which possibly results in a loss of high resolution information.
In this article, we focus on weak supervision with regression neural networks for counting.Regression networks have widely been optimized with local labels such as voxel coordinates [10], distance maps [11], [12] or depth maps [13].Less frequently, regression networks have been used to predict global labels, such as pedestrian count [14], age [15], or brain lesion load [16].In this article, we optimize regression networks with global labels, namely the number of target objects present in the images, but use this as a means for detection: attention maps are computed during inference to reveal the location of the target objects.Segui et al. [14] used a similar technique, but did not quantify the detection performance of this approach.

CAM methods
This category consists of variants of the class activation map (CAM) method proposed by Zhou et al. [8].CAMs are computed from the last feature maps of the network.During network optimization, these feature maps are followed by arXiv:1906.01891v2[cs.CV] 14 Jun 2019 a global pooling layer, and usually one or more fully connected layers to connect to the output of the network.CAMs are computed during inference as a linear combination of these last feature maps, weighted by the parameters of the fully connected layers learnt during training.If the last feature maps have a much lower resolution than the inputas is the case in deep networks with multiple pooling layers -the resulting attention maps can be very coarse.This is suboptimal when small objects need to be localized, or when contours need to be segmented precisely.To alleviate this issue, Dubost et al. [16] and Schlemper et al. [19] proposed to include high resolution feature maps in the computation of the attention maps.Dubost et al. [16] combined feature maps of different resolutions via upsampling, skip connections and concatenation similarly to U-Net [20], while Schlemper et al. [19] used gated attention mechanisms, which rely on the implicit computation of internal attention maps.Selvaraju et al. [9] proposed to generalize CAM to any network architecture, combining features maps using weights computed with the derivative of the output with respect to these feature maps.Unlike other CAM methods, the method by Selvaraju et al. [9] does not require the presence of a global pooling layer in the network, and can be computed for any layer of the network.These methods are detailed in the method section.

Gradient methods
Simonyan et al. [17] proposed to compute attention maps using the derivative of a classification networks output with respect to the input image.These attention maps are finegrained, but often noisy.Springenberg et al. [18] reduced this noise by masking the values corresponding to negative entries of the gradient signal in the ReLU activations.Gradients methods can be applied to any CNN.

Perturbation methods
Perturbation methods compute attention maps by applying random perturbations to the input and observe the changes in the output.These methods are model-agnostic, they can be used with any prediction model, not even necessarily restricted to neural networks.One of the simplest and most effective implementations of such methods was recently proposed by Petsiuk et al. [21] with masking perturbations.The input is masked with a series of random smooth masks before being passed to the network.Attention maps are computed as a linear combination of these masks weighted by the updated network classification scores.This method relies on a mask sampling technique, where the masks are first sampled in a lower dimensional space and then scaled to the size of the full image.Earlier, Fong et al. [22] proposed several other perturbation techniques including replacing a region with a constant value, injecting noise, and blurring the image.

Other methods
Other weakly supervised detection methods have been proposed relying for instance on latent support vector machines (SVMs) [23], a reformulation of the multiple instance learning mi-SVMs [24], or more recently, on multiple instance learning with attention-based neural networks trained with bags of image patches [25], or on iterative learning with neural networks classifiers, where the training set is made of subsets of most reliable bounding boxes from the last iteration Sangineto et al. [26].

Contributions
We propose a novel weakly supervised detection method.The principle of the method is to use an encoder-decoder segmentation architecture to compute attention maps at full input resolution, to help the detection of small objects.Preliminary work on this idea was published in [16].In the current work, skip connections have been added in the networks architecture, and the evaluation is substantially more extensive.The proposed method is compared to state-ofthe-art attention map methods [9], [17], [18], [19] in MNISTbased detection datasets, and in the 3D detection of enlarged perivascular spaces (PVS), a type of brain lesion that is associated with cerebral small vessel disease citebrown2018.Quantifying and detecting PVS is challenging both visually and automatically, because PVS are very small (at the limit of the scan resolution), often occur in high counts, and can easily be confused with multiple other types of lesions [27], [28], [29], [30].
All state-of-the-art attention map computation methods mentioned in section 1.1 [9], [17], [18], [19] and most weakly supervised detection neural networks [26], [31], [32], [33], [34], [35] were originally proposed to be optimized with a global classification objective.In this work we use instead a global regression objective, and show that, when counting labels are available, this is more suited to the weakly supervised detection of objects appearing multiple times in an image.
The current work is also the only study to date to evaluate automated PVS detection in such a large dataset (four regions and 2202 scans) using center locations of PVS.

METHODS
We implemented multiple state-of-the-art methods for weakly supervised detection with CNNs: (a) GP-Unet (this article), (b) GP-Unet no residual [16], the first proposed version of GP-Unet, (c) Gated Attention [19], (d) Grad-CAM [9], (e) Grad [17], and (f) Guided backpropgation [18].These methods were selected either because they became standard for the computation of attention maps, or because they were evaluated in medical datasets.For all methods, the CNNs are designed to output a single scalar ŷ ∈ R and are trained to minimize the mean squared error between ŷ and the number of occurrences of target objects y ∈ N. Then for a given input image I the attention map M is computed at inference time.Below, we detail the computation of these attention maps for each method.

CAM methods
The principle of all CAM methods is to use the feature maps -or activation maps -of the network to compute attention maps.CAM methods usually exploit the feature maps of the last convolutional layer of the network, as these are expected to be more closely related to the target The attention map M CAM is then computed as a linear combination the features maps f k (before global pooling) using the weights of the fully connected layer w k : The computation of CAM attention maps is illustrated in Figure 1.

GP-Unet.
In the approach by Zhou et al. [8] the attention map is computed from the last feature maps of the network, which are often downsampled with respect to the input image due to pooling layers in the network.This leads to low resolution attention maps.To alleviate this problem, we use the CAM principle with the architecture of a segmentation network (U-net from Ronneberger et al. [20]), i.e. with an upsampling path (decoder), where the feature maps f k of the last convolution layer -before global pooling (GP)have the same size as the input image I (see architectures in Figure 2 and section 2.2).The attention maps are still computed with Equation 1.

GP-Unet no residual.
In our earlier work, we proposed another version of GP-Unet [16] based on a deeper architecture without residual connections (see architectures in Figure 2 and section 2.2).Experiments showed that such deep architecture was not needed and could slow the optimization.We refer to this approach as GP-Unet no residual in the rest of the paper.To detect hyperintense brain lesions in MRI data, Dubost et al. [16] also rescaled the attention map values to [0, 1] and summed them pixel-wise with rescaled image intensities.We implemented this postprocessing step in GP-Unet no residual for the detection of brain lesion.The did not implement this in the new version of GP-Unet because we suspected that residual connections between the input and output of two successive convolutional layers could allow the network to better learn this operation.Gated Attention.Similar to GP-Unet, Gated Attention [19] computes attention maps at the resolution of the input image.While in GP-Unet we proposed to upsample and concatenate feature maps of different scales [16] as advised for segmentation networks by Ronneberger et al. [20], Schlemper et al. [19] proposed instead a more complex gated attention mechanism to combine information from different scales.This gated attention mechanism relies on attention units -also called attention gates -that compute soft attention maps and use these maps to mask irrelevant information in the feature maps.In addition to the gated attention mechanism, global pooling is applied at every scale s and the results are directly linked to the output by a fully connected layer aggregating information across scales.Schlemper et al. [19] proposed three aggregation strategies with only small differences in results.For the sake of simplicity, we employed the concatenation strategy in our experiments.See Figure 2 for an illustration of the architectures of Gated Attention and of GP-Unet.The attention maps M Gated of the gated attention mechanism method are computed as: where w s k are the weights of the last fully connected layer for the neurons computed from the feature maps f s k at scale s.

Grad-CAM.
Finally, Grad-CAM [9] is a generalization of CAM [8] to any network architecture.The computation of the attention map is similar to Equation 1, but uses different weights α k in the linear combination.The weights α k are computed with the guided backpropagation algorithm.With this technique the global pooling layer is not needed anymore, and attention maps can be computed from any layer in any network architecture.More precisely, each weight α k is computed as the average over all voxels of the derivative of the output ŷ with respect to the feature maps f k of the target convolution layer.In our case, we use the feature maps of the last convolution layer preceding global pooling, and the weights are computed as: where Z is the number of voxels in the feature map f k .The attention map M Grad−CAM is then computed as a linear combination of the feature maps weighted by the α k and is upsampled with linear interpolation to compensate the maxpooling layers: In their original work, Selvaraju et al. [9] proposed to compute attention maps from any chosen layer in the network, to generate multiple explanations for the network's behavior.We used the feature maps f k of the last convolution layer.

Gradient methods
Grad.Simonyan et al. [17] first proposed to compute attention maps by estimating the gradient of the output with respect to the input image.Gradients are computed with the backpropagation algorithm.This method highlights pixels for which a small change would affect the prediction ŷ by a large amount.The attention map M Grad is computed as Guided backpropagation.The attention maps obtained by Grad can highlight fine detail in the input image, but often display noise patterns.This noise mostly results from negative gradients flowing back in the rectified linear unit (ReLU) activations.These negative gradients are believed to interact with positive gradients according to an interference phenomenon [36].With the standard backpropagation algorithm, during the backward pass, ReLU sets to zero gradients corresponding to negative entries of the bottom data (indices corresponding to negative values in feature maps that precede the ReLU, and come from the input to the CNN), but not those that have a negative value in the top layer (negative gradients).Springenberg et al. [18] proposed to additionally mask out the values corresponding to negative entries of the top gradient in the ReLU activations.This is motivated by the deconvolution approach [37], which can been seen as a backward pass through the CNN where the information passes in reverse direction through the ReLU activations [17], [18].Masking out these negative entries from the top layer effectively clears the noise in the attention maps.

Network architectures
In total, four architectures were implemented to evaluate all six methods.These architectures are illustrated in Figure 2. Grad, Guided backpropagation, and Grad-CAM use the same neural networks (same architecture and weights), but differ in the computation of the attention maps during inference.The other methods require different architectures, and are trained separately.In the following subsections, we detail the components of each architecture in 3D.
We perform experiments with 3D and 2D CNNs.The 3D CNNs use 3D convolutional layers with 3x3x3 filters with zero-padding, and 3D maxpooling layers of size 2x2x2.Similarly, the 2D CNNs use 2D convolutional layers with 3x3 filters with zero-padding, and 2D maxpooling layers of size 2x2.The 2D CNNs always use four times fewer features maps than their 3D counterpart to allow faster experimentation.After the last convolution layer, each feature map is projected to a single neuron using global average pooling.These neurons are connected with a fully connected layer to a single neuron indicating the output of network ŷ ∈ R. Rectified linear unit (ReLU) activations are used after each convolution.When using skip connections, feature maps of different layers are concatenated, as proposed by Ronneberger et al. [20].
Because the networks have here only a single output neuron, feature maps could also be combined into a single feature map before applying global pooling.The attention map would then simply be the last feature map of the network, and no combination weight would have to be computed.We use instead the more standard architecture [8], [38], and apply global pooling before combining the information of different feature maps.Each of the last feature maps can then encode a different (type of) feature.This approach has been proposed for multi-class networks and has the advantage of being more general.2) GP-Unet architecture is a small segmentation network, with an encoder and a decoder part.The architecture starts with two convolutional layers with 32 filters each.The output of these two layers is concatenated with the input.Then follows a maxpooling layer and two convolutional layers with 64 filters each.The feature maps preceding and following these two layers are concatenated.In order to combine features at different scales, these low dimension feature maps are upsampled, concatenated with features maps preceding the maxpooling layer, and given to a convolutional layer of 32 filters.Then follows a global average pooling layer, from which a fully connected layer maps to the output.This architecture is simple (less than 309 000 parameters for the 3D version), fast to train (less than one day on 1070 Nvidia GPU), and allows computing attention maps at the full resolution of the 3D input images.

Gated Attention architecture (B in Figure 2)
We adapted the architecture of the Gated Attention network proposed by Schlemper et al. [19] to make it more comparable to the other approaches presented in the current work.Here, the Gated Attention architecture is the same as GP-Unet architecture (A) except for two differences: to merge the feature maps between the two different scales, instead of upsampling, concatenation and convolution, we use the attention gate as described by Schlemper et al. [19], and which objective is to compute a soft attention map that mask irrelevant information coming from the high resolution feature map.The other difference is that, in this architecture (B), the downsampled feature maps are not only used to compute upsampled feature maps but are also projected to single neurons with global pooling.The neurons corresponding to the two different scales are then aggregated (using concatenation) and connected to the single output neuron with a single fully connected layer.
The attention gate computes a normalized internal attention map to mask irrelevant information coming from the higher resolution feature maps.In their implementation, Schlemper et al. [19] proposed a custom normalization to prevent the attention map from becoming too sparse.We did not experience such problems and opted for the standard sigmoid normalization.2)

Base architecture (C in Figure
The network architecture used for Grad, Guided backpropagation, and Grad-CAM is kept as similar as possible to that of GP-Unet for better comparison of methods.It starts with two convolutional layers with 32 filters each.The output of these two layers is concatenated with the input.Then follows a maxpooling layer and two convolutional layers with 64 filters each.The output of these two layers is concatenated with the feature maps following the maxpooling layer, and is given directly to the global average pooling layer.In other words, we apply global pooling to the original image (after maxpooling) and the feature maps after the second convolution at each scale -so on 1+32+64 feature maps.This architecture has shown competitive performance on different types of problems in our experiments (eg. in brain lesions in [27]).With this architecture, unlike GP-Unet, Grad-CAM produces attention maps at a resolution twice smaller than that of the input image, and could miss small target objects.

GP-Unet no residual architecture (D in Figure 2)
The architecture of GP-Unet no residual was proposed by Dubost et al. [16].In this work, we only changed the global pooling layer from maximum to average to make comparisons between methods more meaningful.This network is a segmentation network with a downsampling and upsampling path.The downsampling path has two convolutional layers of 32 filters, a maxpooling layer, two convolutional layers of 64 filters, a maxpooling layer, and one convolutional layer of 128 filters.The upsampling path starts with an upsampling layer, concatenates the upsampled feature maps with the features maps preceding the maxpooling layer in the downsampling path, computes a convolutional layer with 64 filters, and repeat this complete process for the last scale of feature maps, with a convolutional layer of 32 filters.After that, comes the global pooling layer, and fully connected layer to a single neuron.
The difference with architecture (A) is that the feature maps are downsampled twice instead of once, and that there are no residual connections between sets of two consecutive convolutions.Consequently, the last convolution layer does not have direct access to the input image intensities.We believe these residual connections make the design of GP-Unet more flexible than this architecture, by facilitating for instance the network to directly use the input intensities and locally adjust its predictions.This can be crucial for the correct detection of brain lesions.

EXPERIMENTS
Using the area under the free-response receiver operating characteristic (FROC) curve, sensitivity and average of false positive detection per scan, we compare our proposed method to the weakly supervised detection methods described in section 2, namely GP-Unet, GP-Unet no residual, Gated Attention, Grad-CAM, Grad, and Guided backpropagation.We use MNIST [39] datasets to compare regression against classification for weakly supervised digit counting.We also compared performance of the different methodsall using regression objectives -on weakly supervised lesion detection in a large brain MRI dataset.

MNIST Datasets
We construct images as a grid of 7 by 5 randomly sampled MNIST digit images.Examples are shown in Figures 3  and 4. Each digit is uniformly drawn from the set of all training/validation/testing digits, hence with a probability 0.1 to be a target digit d.To avoid class imbalance, we adapt the dataset to each target digit d by sampling 50% of images with no occurrence of d, and 50% of images with at least one occurence of d, resulting in ten different datasets.

Brain Datasets
Brain MRI was performed within the setting of the population-based Rotterdam Study [40] on a 1.5-Tesla MRI scanner (GE-Healthcare, Milwaukee, WI, USA) with an eight-channel head coil to obtain 3D T2-contrast magnetic resonance scans.The full imaging protocol has been described by Ikram et al. [40].In total, our dataset contains 2202 brain MRI scans.Each scan was acquired from a different subject from the general population typically ranging from 45 to 60 years of age.
An expert rater annotated PVS on each T2 brain MRI scan in four brain regions: in the complete midbrain and hippocampi, and in a single slice in axial view in the basal ganglia (the slice showing the anterior commissure) and the centrum semiovale (the slice 10 cm above the top of the lateral ventricle).The annotation protocol follows the guidelines by Adams et al. [41] for visual scoring of PVS, with the difference that Adams et al. [41] only counted the number of PVS, while in the current work, all PVS have been marked with a dot approximately in their center.

Evaluation objectives versus training objective
In the MNIST datasets, the evaluation objective is to detect all occurrences of a target digit d.During optimization, the training regression objective is to count the number of occurrences of d, while the training classification objective is to detect the presence of at least one occurence of d.
In the experiments on 3D brain MRI scans, the evaluation objective is to detect enlarged perivascular spaces (PVS) in the four brain regions described in section 3.2.For these datasets we investigate only regression neural networks.These networks are optimized using the number of annotated PVS in the region of interest as the weak global label, as proposed in our earlier work [16].The location of PVS are only used for the evaluation of the detection during inference.

MNIST data
We scale the image intensity values in the MNIST grid images between zero and one to ease the learning process.

Brain scans
We first apply the FreeSurfer multi-atlas segmentation algorithm [42] to locate and mask the midbrain, hippocampi, basal ganglia and centrum semiovale in each scan.For each region, we then extract a fixed volume centered on the center of mass of the region.For midbrain (88 × 88 × 11 voxels), hippocampi (168 × 128 × 84 voxels) and basal ganglia (168 × 128 × 84 voxels) these cropped volumes contain the full region.The centrum semiovale is too large to fit in the memory of our GPU (graphics processing unit), so for this region we only extract the slices surrounding the slice that was scored by the expert rater (250 × 290 × 14 voxels).We apply a smooth region mask to set values in other brain regions to zero.Finally, we scale the intensity values between zero and one to ease the learning process.The preprocessing and extraction of brain regions is presented in more details in previous work [27].

Training of the networks
All regression networks are optimized with Adadelta [43] to minimize the mean squared error between their prediction ŷ ∈ R and the ground truth count y ∈ N. The classification networks in our MNIST experiments were optimized with Adadelta and the binary cross-entropy loss function.
Weights of the convolution filters and fully connected layers are initialized from a Gaussian distribution with zero mean and unit variance, and biases are initialized to zero.
A validation set is used to prevent over-fitting.The optimization is stopped at least 100 epochs after the validation loss stopped decreasing.We select the model with the lowest validation loss.For the MNIST datasets, the models are trained on a set of 500 images (400 for training and 100 for validation).For the brain datasets, the models are trained on a set of 1202 scans (1000 for training and 202 for validation).
During training, we use on-the-fly data augmentation with a random combination of random translations of up to 2 pixels in all directions, random rotations up to 0.2 radians in all directions, and random flipping in all directions.For the MNIST datasets, the batch size was set to 64.For the brain datasets, because of GPU memory constraints, the networks are trained per sample: each mini-batch contains a single 3D image.As the convergence can be slow in some datasets, we first trained the networks on the smallest and easiest region (midbrain), and fine-tune the parameters for the other regions, similarly to Dubost et al. [27].
We implemented our algorithms in Python in Keras [44] with TensorFlow as backend, and ran the experiments on a Nvidia GeForce GTX 1070 GPU and Nvidia Tesla K40 1 .The average training time was one day.
1. We used computing resources provided by SurfSara at the Dutch Cartesius cluster.For the classification method, in the first row we notice more false positives than for the regression method.On the second row, the two digits 4 at the top are less highlighted than the other digits 4 in the image.It is not the case for the regression attention map.This observation supports the hypothesis that attention maps computed from classification objectives would focus more on the most obvious occurence of the target object, instead of equally focusing on all occurrences.

Negative values in attention maps
The meaning of negative values in attention maps is different for CAM methods and gradient methods.For CAM methods, negative values highlight objects in the image which presence is negatively associated with the target objects.For gradient methods, negative values in the attention map correspond to areas where increasing the intensity would decrease the predicted count (or where decreasing the intensity would increase the predicted count, these are the same areas).
For model interpretation, keeping negative values in attention maps seems most appropriate as it allows the viewer to discover which parts of the image contributed either negatively or positively to the prediction.For detection, the purpose is to find all occurrences of the target object in the image and ignore other objects, including for instance those that correspond to negative values in CAM attention maps.In the literature, two approaches have been proposed to process negative values for object detection: either setting them to zero, or taking the absolute value.CAM methods [9] set to zero negative values of the attention maps to increase detection performance.Gradient methods [17], [18] focus on the magnitude of the derivative and thus compute the absolute value.
In the brain dataset, we aim to solve a detection problem where the target objects are among the highest intensity values in the image (Figures 7 -10).For gradient methods, this implies that negatives values in the attention maps do not indicate the location of the target object in our case.We can therefore ignore negative values, and have decided to set them to zero.For CAM methods, we follow the recommendation of the literature, and also set to zero negative values in attention maps.In the MNIST dataset, we also nullified negative values for all methods for the sake of simplicity.However it could be argued that gradients method may benefit from taking the absolute value.

Performance evaluation
The output of all weakly supervised detection methods presented in Section 2 are attention maps.To obtain the coordinates of the detections, we apply non-maximum suppression on the attention maps using a 2D (for MNIST, centrum semiovale and basal ganglia) or 3D (for hippocampi and midbrain) maximum filter of size 6 voxels (which corresponds to 3 mm in axial plane, the maximum size for PVS as defined by Adams et al. [41] -we used the same value for the MNIST datasets) with 8 neighborhood in 2D or 26 neighborhood in 3D.This results in a set of candidates that we order according to their value in the attention map.The candidates with highest values are considered the most likely to be the target object.
We used the Hungarian algorithm to create an optimal one-to-one match between all detected lesions or digits and their closest annotation in the ground truth.For the brain dataset, we counted a positive detection if a detection was within at most 6 voxels from the corresponding point in the ground truth.This corresponds to the maximum diameter of PVS in the axial view, as defined by Adams et al. [41].For the MNIST datasets, we counted a positive detection if a detection fell inside the 28 × 28 pixels wide original MNIST image of the target digit.
Free-response receiver operating characteristic (FROC) curves [45] were computed to show the trade-off between sensitivity and the average number of false positives per brain region or per image (FPavg).For each network in our experiments, we report the area under the FROC curve (FAUC) computed from 0 to 2 FPavg for MNIST and from 0 to 15 FPavg for brain lesion detection.We also show the standard deviation of the FAUC, computed by bootstrapping the test set (by drawing with replacement subsets of the size of full test set, and computing statistics over these subsets).
In addition to the attention maps, the regression networks also predict the number of target objects in the image.For the detection of brain lesions, we use this predicted count rounded to an integer n to select the top-n candidates with highest scores, and compute the corresponding sensitivity and FPavg.
For the basal ganglia and the centrum semiovale, our dataset does not contain full 3D annotations, but only provides annotations for a single 2D slice per scan (see Section 3.2).For our evaluation we extract the corresponding 2D slice from the attention map prior to post-processing, and compute the metrics only for this slice.In case no lesion was annotated, we selected the middle slice of the attention map as a reasonable approximation of the rated slice.

Intra-rater variability of the lesion annotations
Intra-rater variability was measured in each region using a separate set of 40 MRI scans acquired and annotated with the same protocol.The rater annotated PVS twice in each scan with two weeks of interval, and in a different random order.
To compute the sensitivity and FPavg for the Intra-rater variability, one of the two series of annotations has to be set as reference to define true positives, positives and false positives.We successively set the first and second series of annotations as reference, leading to two different results.All results for all regions are displayed next to the FROC curves in Figure 6.

MNIST datasets
The methods were evaluated on left-out test sets of 500 images, balanced as described in section 3.1 Figure 5 compares the FAUC of regression and classification networks, for all MNIST digits, and for all methods.Overall, regression methods reach a higher detection performance than classification methods.For all digits, regression GP-Unet no residual reaches the best performance.The second best method for all digits is regression GP-Unet.Both GP-Unet regression methods are consistently better than any other method for all digits.Regression Grad-CAM comes third, and regression Guided backpropagation fourth.Grad and Gated Attention come last.The ordering of best classification methods is different from that of the best (regression) methods: Guided backpropagation comes first, Grad-CAM second and GP-Unet no residual third.Figure 3 shows an example of the attention maps obtained for all weakly supervised methods optimized with regression objectives.As expected, Grad produces noisy attention maps with many high values, for both classification and regression objectives, and Guided backpropagation corrects these mistakes.Gradient methods seem to highlight multiple discriminating features of the digit 4 (e.g. its top branches), while CAM methods highlight a single larger, less detailed region.This may suggest that gradient methods could be more suited to weakly supervised segmentation, although judging from the figure, none of the methods seems capable of correctly segmenting digits.
Figure 4 compares attention maps of GP-Unet optimized with regression and classification.We noticed two interesting differences.First, when the target digit is present in the image, the regression attention map highlights each occurrence of the target digit with a similar intensity, while the classification attention map highlights more strongly the most obvious occurrences of the target digit.Second, when the target digit is not present in the image, contrary to the regression attention map, the classification attention map may highlight many false positives, possibly resulting in a significant drop in the detection performance.

Regression Guided backpropagation vs Grad. Regression
Guided backpropagation detects all digits more accurately than regression Grad.The same holds for classification Guided backpropagation versus classification Grad.However, regression Grad sometimes performs as well as (digits 4, 6, 7) or better (digits 0, 9) than Classification Guided backpropagation, which underlines the added-value of optimizing weakly supervised detection methods with regression objectives instead of classification objectives.

Detection of brain lesions
In the brain dataset, we compare the performance of the weakly supervised methods for the detection enlarged perivascular spaces (PVS) by evaluating them on the leftout test set of 1000 scans and in the four regions.PVS appear as hyperintense areas in the T2-weighted images.In some regions -especially midbrain, and to some extent basal ganglia -the image intensity can often be discriminative enough and can be used as a crude attention map.We therefore include the raw image intensity as one of the attention maps in our comparison, and use the lesion count n predicted using the base architecture (see Section 2.2) to select the operating point on the FROC curve.This intensity method reaches a competitive performance in the midbrain and basal ganglia, but completely fails in the hippocampi and centrum semiovale because it highlights many false positives, corresponding to other hyperintense structures that look similar to enlarged perivascular spaces, but have a different biological meaning.Surrounding cere-

TABLE 1
FAUCs for the detection of brain lesions.To compute the these FAUCs, we integrate the FROC (Figure 6) between 0 and 15 (Section 3.7).The best performance in each region is indicated in bold.

GP-Unet (this paper) GP-Unet no residual
Dubost et al. [16] Intensities Section 4.2 Gated Attention Schlemper et al. [19] Grad Simonyan et al. [17] Guided backprop Springenberg et al. [18] Grad-CAM Selvaraju et al. [ 6 shows FROC curves for all methods in the brain datasets.Table 1 shows the corresponding FAUCs.Table 3 and 2 show the sensitivity and FPavg measured at the operating point chosen for each method as described in Section 3.7.
Judging from Tables 1 to 3, the methods achieving the best results are GP-Unet, Grad-CAM and Guidbackpropagation.Unlike the results on MNIST datasets, there is no method consistently better than others for all regions.Across methods, Guided backpropagation reaches the best FAUC in the midbrain and basal ganglia, GP-Unet reaches the best FAUC in the hippocampi, and finally, with a similar performance, GP-Unet and Grad-CAM achieve the best FAUC in the centrum semiovale.
In Figure 6, the sensitivity and FPavg between two series of annotations of the same scans from the same rater (green triangle) gives an idea of the difficulty of detecting PVS in each region.In the midbrain and hippocampi, PVS are relatively easy to identify, as they are the only hyperintense lesion visible on T2 images.On the contrary, the detection of PVS in the basal ganglia and centrum semiovale is much more challenging, because in those regions other hyperintense structures that look similar to enlarged perivascular spaces could be present.In all regions, the performance of the automated methods come close to the intra-rater agreement.This intrarater agreement was however computed on a substantially smaller set.Interestingly, multiple methods highlight the same false positives.After visual checking by experts, many of these false positives appear to be PVS that were not annotated by the rater.

Comparison of CAM methods
Grad-CAM and GP-Unet reach similar FAUCs (Table 1) in the basal ganglia and centrum semiovale.However, GP-Unet outperforms Grad-CAM in the midbrain and by a large margin in the hippocampi.In these two regions, at the operating point Grad-CAM suffers from more false positives than GP-Unet, while having a similar or worse sensitivity (Table 2 and 3).This can also be observed in the attention maps of the hippocampi (Figure 8) and to some extent those of the midbrain (Figure 7).There, GP-Unet is less distracted by the surrounding cerebrospinal fluid than Grad-CAM or the methods emphasizing intensities (GP-Unet no residual, Intensities) are.
The motivation of Gated Attention is similar to that of GP-Unet: combining multiscale information in the computation of attention maps.In the MNIST datasets, Gated Attention seems to benefit less often from the regression objective than the other methods.While Gated Attention and GP-Unet reach a similar detection performance when optimized with classification objectives, contrary to GP-Unet, Gated Attention rarely benefits from the regression objective.In the brain datasets, Gated Attention works better than the intensity baseline, Grad, and GP-Unet no residual, but performs significantly worse than Grad-CAM, Guided backpropagation, and GP-Unet.These results suggest that gate mechanisms may harm the detection performance for networks optimized with regression objectives, and that combining  features from multiple scales via a simple concatenation of feature maps should be preferred.

DISCUSSION
We compared six weakly supervised detection methods in two datasets.In MNIST datasets, GP-Unet no residual [16] and GP-Unet (this article) perform significantly better than all other methods, probably because they can combine the information of different scales more effectively than other methods.For GP-Unet no residual, part of this performance difference could also be explained by the larger number of parameters and larger receptive field (Section 2.2).However, for GP-Unet, the number of parameters is comparable to that of the other methods.In the brain dataset, the best methods are Guided backpropagation [18] with 74.1 average FAUC over regions, and GP-Unet with 72.1 average FAUC.Depending on the brain region, either Guided backpropagation or GP-Unet performed best.
Due to the special properties of the PVS detection problem in the brain datasets, intensity thresholding provides a simple approach to solving the same problem in some regions.Although intensity thresholding yields the lowest performance in hippocampi, basal ganglia, and centrum semiovale, its results are still reasonable in basal ganglia and it achieves the second best FAUC in the midbrain where almost all hyperintensities in the segmented region correspond to PVS.In the regions where intensity thresholding achieved reasonable results (midbrain and basal ganglia), Guided backpropagation performed best.In the datasets where the intensity method failed (hippocampi and centrum semiovale), GP-Unet reached the best performance (similar to that of Grad-CAM in the centrum semiovale).More generally, gradient methods seem to work best when the target objects are also the most salient objects, while CAM methods are a better choice when saliency alone is not sufficient.This observation can also be extended to the MNIST datasets, where saliency is not discriminative, and regression CAM methods (Gated Attention excluded) outperform regression gradient methods.
Recently, researchers have investigated the effect of saliency on the computation of attention maps.Adebayo et al. [46] showed that, for Guided backpropagation, classification networks trained with random labels or networks with randomized weights obtained similar attention maps as networks trained with the correct labels, hinting that attention maps method may focus more on salient objects in the image than the target object.In these experiments, attention maps computed with Grad and Grad-CAM obtained better results.Adebayo et al. warn of the evaluation of attention maps by only visual appeal, and advocate more rigorous forms of evaluation.This is in line with the current article, in which we quantify the detection performance of attention maps in a large real world dataset.
In our preliminary work on PVS detection in the basal ganglia using GP-Unet no residual [16] we obtained slightly different results from what is presented in the current work.This reflects differences in the test data set, the annotation procedure, method, and more sophisticated postprocessing where the non-maximum suppression clears the noise in the attention maps.In addition, the current work also includes scans without annotations (because the rater found no lesion), where there could have been errors in finding the slice evaluated by the rater.
Overall, results showed that weakly supervised methods can detect enlarged perivascular spaces almost as well as expert raters.The performance of the best detection methods was close to the intrarater agreement computed on a different set.Finally, further visual inspection also revealed that many of the false positives correspond to PVS that were not annotated by the human rater.Especially in scans with a large number of PVS, often some of the smaller or less obvious PVS were not annotated.
The variety of challenges present in the brain datasets are well suited to explore the evaluation of weakly supervised detection methods.Many observations and results could likely be generalized, for instance to the detection of other types of multiple small objects, small in regard to the image resolution, such as cars from high resolution satellite images [47], [48], [49], or humans in crowds [50], [51].
The work presented in this article implies that pixel-level annotations may not always be needed to train accurate models for detection problems.This is especially relevant in medical imaging, where annotation requires expert knowledge and high quality annotations are therefore difficult to obtain.Weakly supervised methods enable learning from large databases with less annotation effort, and could also help to reduce the dependence on annotator biases.The global label may be more reliable, because for some abnormalities raters can agree well on the presence or global burden of the abnormalities but poorly on their boundaries or spatial distribution.

CONCLUSION
We proposed a new weakly supervised detection method, GP-Unet, that uses an encoder-decoder architecture optimized only with global labels.With the help of higher scale feature information from skip-connections coming from the encoder, the decoder part upsamples feature maps and enables the computation of attention maps at the resolution of the input image, which helps the detection of small objects.We also showed the advantage of using regression objectives over classification objectives for the detection of multiple objects.We compared the proposed method to state-of-the-art methods on the detection of digits in MNIST-based datasets and in a real life example on the detection of enlarged perivascular spaces -a type of brain lesion -from 3D brain MRI.The best weakly supervised detection methods were Guided backpropagation [18], and the proposed GP-Unet.In our experiments, we noticed that methods based on the gradient of the output of the network, such as Guided backpropagation, seemed to work best in datasets where the target objects are also the most salient objects.In other datasets, where there can be other objects as salient or even more salient than the target object, methods using class activation maps, such as GP-Unet, seemed to worked best.

2 AFig. 2 .
Fig. 2. Architectures.A is GP-Unet's architecture.B is Gated Attention architecture.C is the base architecture used for Grad, Guided backpropagation, and Grad-CAM.And D is GP-Unet no residual architecture.GAP stand for global average pooling layer, FC for fully connected layer, and A for attention gate.All architectures are detailed in Section 2.2.

Fig. 3 .Fig. 4 .
Fig. 3. Examples of attention maps of the different weakly supervised detections methods for the detection of digit 4. Top-left: MNIST image.All methods optimized with regression objectives.

Fig. 5 .
Fig. 5. FAUCs (Section 3.7) on the MNIST dataset for all methods.Each subplot corresponds to the detection of a different digit.Results for regression networks are displayed in light blue (left), and results for classification networks are displayed in indigo (right).A: Grad, B: Guided backpropagation, C: Grad-CAM, D: GP-Unet no residual, E: GP-Unet, F: Gated Attention.FAUCs are displayed with standard deviations computed by bootstrapping the test set.

Fig. 6 .
Fig.6.FROC curves of enlarged perivascular spaces detection in the brain MRI in four different regions.The average number of false positives per scan is displayed on the x-axis, and the sensitivity on the y-axis.Axes have been rescaled for better visibility.The green triangles indicate intra-rater agreement (on a smaller set) as described in Section 3.8 .

Fig. 7 .
Fig. 7. Attentions maps in the midbrain.The top left image shows a slice of an example image of the midbrain after preprocessing, with PVS indicated with red circles.The other images correspond to attention maps computed for that same slice.Red values correspond to high values in the attention maps.
Dubost is with the Biomedical Imaging Group Rotterdam, Departments of Radiology and Medical Informatics.Erasmus Medical Center, 3015 GE Rotterdam.The Netherlands.E-mail: floriandubost1@gmail.com • H. Adams, P. Yilmaz and M. Vernooij are with the departments of Radiology and Epidemiology.Erasmus Medical Center, 3015 GE Rotterdam.The Netherlands.• A. Ikram is with the departments of Radiology, Epidemiology and Neurology.Erasmus Medical Center, 3015 GE Rotterdam.The Netherlands.• G. Bortsova is with the Biomedical Imaging Group Rotterdam, Departments of Radiology and Medical Informatics.Erasmus Medical Center, 3015 GE Rotterdam.The Netherlands.• W. Niessen is with the Biomedical Imaging Group Rotterdam, Departments of Radiology and Medical Informatics.Erasmus Medical Center, 3015 GE Rotterdam.The Netherlands, and also with the department of Imaging Physics, Faculty of Applied Science.TU Delft.The Netherlands.
• F. • M. de Bruijne is with the Biomedical Imaging Group Rotterdam, Departments of Radiology and Medical Informatics, Erasmus Medical Center, 3015 GE Rotterdam, The Netherlands, and also with the Machine Learning Section, Department of Computer Science, University of Copenhagen, DK-2110 Copenhagen, Denmark.

GP Output Last Convolution GP GP GP Attention Map Fully Connected TRAIN -Regression TEST -Detection GP Output Last Convolution Fully Connected Fig. 1. Principle of CAM methods for regression. GP
stands for Global Pooling.f k correspond to the feature maps of the last convolutional layer.Disks correspond to scalar values.w k are the weights of the fully connected layer.Left: the architecture of the network during training.Right: the architecture at inference time, where the global pooling is removed.

TABLE 2 Average number of false positives per scan in the brain datasets.
Best performances are indicated in bold.

TABLE 3 Sensitivity in the brain datasets.
Best performance are indicated in bold.