Keywords

1 Introduction

Computer vision research has recently made tremendous progress. Many challenging vision tasks can now be solved with high accuracy, assuming that sufficiently much annotated data is available for training. Unfortunately, collecting large labeled datasets is time consuming and typically requires substantial financial investments. Therefore, the creation of training data has become a bottleneck for the further development of computer vision methods. Unlabeled visual data, however, can be collected in large amounts in a relatively fast and cheap manner. Therefore, a promising direction in the computer vision research is to develop methods that can learn from unlabeled or partially labeled data.

In this paper we focus on the task of semantic image segmentation. Image segmentation is a prominent example of an important vision task, for which creating annotations is especially costly: as reported in [4, 29], manually producing segmentation masks requires several worker-minutes per image. Therefore, a large body of previous research studies how to train segmentation models from weaker forms of annotation.

A particularly appealing setting is to learn image segmentation models using training sets with only per-image labels, as this form of weak supervision can be collected very efficiently. However, there is currently still a large performance gap between models trained from per-image labels and models trained from full segmentations masks. In this paper we demonstrate that this gap can be substantially reduced compared to the previous state-of-the-art techniques.

We propose a new composite loss function for training convolutional neural networks for the task of weakly-supervised image segmentation. Our approach relies on the following three insights:

  • Image classification neural networks, such as AlexNet [19] or VGG [33], can be used to generate reliable object localization cues (seeds), but fail to predict the exact spatial extent of the objects. We incorporate this aspect by using a seeding loss that encourages a segmentation network to match localization cues but that is agnostic about the rest of the image.

  • To train a segmentation network from per-image annotation, a global pooling layer can be used that aggregates segmentation masks into image-level label scores. The choice of this layer has large impact on the quality of segmentations. For example, max-pooling tends to underestimate the size of objects while average-pooling tends to overestimate it [26]. We propose a global weighted rank pooling that is leveraged by expansion loss to expand the object seeds to regions of a reasonable size. It generalizes max-pooling and average pooling and outperforms them in our empirical study.

  • Networks trained from image-level labels rarely capture the precise boundaries of objects in an image. Postprocessing by fully-connected conditional random fields (CRF) at test time is often insufficient to overcome this effect, because once the networks have been trained they tend to be confident even about misclassified regions. We propose a new constrain-to-boundary loss that alleviates the problem of imprecise boundaries already at training time. It strives to constrain predicted segmentation masks to respect low-level image information, in particular object boundaries.

We name our approach SEC, as it is based on three principles: Seed, Expand and Constrain. We formally define and discuss the individual components of the SEC loss function in Sect. 3. In Sect. 4 we experimentally evaluate it on the PASCAL VOC 2012 image segmentation benchmark, showing that it substantially outperforms the previous state-of-the-art techniques under the same experimental settings. We also provide further insight by discussing and evaluating the effect of each of our contributions separately through additional experiments.

2 Related Work

Semantic image segmentation, i.e. assigning a semantic class label to each pixel of an image, is a topic of relatively recent interest in computer vision research, as it required the availability of modern machine learning techniques, such as discriminative classifiers [5, 31] or probabilistic graphical models [21, 28]. As the creation of fully annotated training data poses a major bottleneck to the further improvement of these systems, weakly supervised training methods were soon proposed in order to save annotation effort. In particular, competitive methods were developed that only require partial segmentations [11, 37] or object bounding boxes [8, 20, 52] as training data.

A remaining challenge is, however, to learn segmentation models from just image-level labels [35, 36]. Existing approaches fall into three broad categories. Graph-based models infer labels for segments or superpixels based on their similarity within or between images [27, 43, 4648]. Variants of multiple instance learning [1] train with a per-image loss function, while internally maintaining a spatial representation of the image that can be used to produce segmentation masks [3840]. Methods in the tradition of self-training [30] train a fully-supervised model but create the necessary pixel-level annotation using the model itself in an EM-like procedure [44, 45, 49]. Our SEC approach contains aspects of the latter two approaches, as it makes use of a per-image loss as well as per-pixel loss terms.

In terms of segmentation quality, currently only methods based on deep convolutional networks [19, 33] are strong enough to tackle segmentation datasets of difficulty similar to what fully-supervised methods can handle, such as the PASCAL VOC 2012 [9], which we make use of in this work. In particular, MIL-FCN [25], MIL-ILP [26] and the approaches of [4, 18] leverage deep networks in a multiple instance learning setting, differing mainly in their pooling strategies, i.e. how they convert their internal spatial representation to per-image labels. EM-Adapt [23] and CCNN [24] rely on the self-training framework and differ in how they enforce the consistency between the per-image annotation and the predicted segmentation masks. SN_B [41] adds additional steps for creating and combining multiple object proposals. As far as possible, we provide an experimental comparison to these methods in Sect. 4.

3 Weakly Supervised Segmentation from Image-Level Labels

In this section we present a technical description of our approach. We denote the space of images by \(\mathcal {X}\). For any image \(X \in \mathcal {X}\), a segmentation mask Y is a collection, \((y_1, \dots , y_n)\), of semantic labels at n spatial locations. The semantic labels belong to a set \(\mathcal {C} = \mathcal {C}^\prime \cup \{c^{\mathrm {bg}}\}\) of size k, where \(\mathcal {C}^\prime \) is a set of all foreground labels and \(c^{\mathrm {bg}}\) is a background label. We assume that the training data, \(\mathcal {D} = \{(X_i, T_i)\}_{i=1}^{N}\), consists of N images, \(X_i \in \mathcal {X}\), where each image is weakly annotated by a set, \(T_i \subset \mathcal {C}^\prime \), of foreground labels that occur in the image. Our goal is to train a deep convolutional neural network \(f(X; \theta )\), parameterized by \(\theta \), that models the conditional probability of observing any label \(c \in \mathcal {C}\) at any location \(u \in \{1, 2, \dots , n\}\), i.e. \(f_{u,c}(X; \theta ) = p(y_u = c | X)\). For brevity we will often omit the parameters \(\theta \) in our notation and write \(f(X; \theta )\) simply as f(X).

Fig. 1.
figure 1

A schematic illustration of SEC that is based on minimizing a composite loss function consisting of three terms: seeding loss, expansion loss and constrain-to-boundary loss. See Sect. 3 for details.

3.1 The SEC Loss for Weakly Supervised Image Segmentation

Our approach for learning the parameters, \(\theta \), of the segmentation neural network relies on minimizing a loss function that has three terms. The first term, \(L_\text {seed}\), provides localization hints to the network, the second term, \(L_\text {expand}\), penalizes the network for predicting segmentation masks with too small or wrong objects, and the third term, \(L_\text {constrain}\), encourages segmentations that respect the spatial and color structure of the images. Overall, we propose to solve the following optimization problem for parameter learning:

$$\begin{aligned} \min \limits _{\theta } \!\!\! \sum \limits _{(X, T) \in \mathcal {D}} \!\!\!\!\! \left[ L_\text {seed}(f(X; \theta ), T) + L_\text {expand}(f(X; \theta ), T) + L_\text {constrain}(X, f(X; \theta )) \right] . \end{aligned}$$
(1)

In the rest of this section we explain each loss term in detail. A schematic overview of the setup can be found in Fig. 1.

Seeding Loss with Localization Cues. Image-level labels do not explicitly provide any information about the position of semantic objects in an image. Nevertheless, as was noted in many recent research papers [3, 22, 32, 50], deep image classification networks that were trained just from image-level labels, may be successfully employed to retrieve cues on object localization. We call this procedure weak localization and illustrate it in Fig. 2.

Unfortunately, localization cues typically are not precise enough to be used as full and accurate segmentation masks. However, these cues can be very useful to guide the weakly-supervised segmentation network. We propose to use a seeding loss to encourage predictions of the neural network to match only “landmarks” given by the weak localization procedure while ignoring the rest of the image. Suppose that \(S_c\) is a set of locations that are labeled with class c by the weak localization procedure. Then, the seeding loss \(L_\text {seed}\) has the following form:

$$\begin{aligned} L_\text {seed}(f(X), T, S_c) = -\dfrac{1}{\sum \limits _{c \in T} |S_c|} \sum \limits _{c \in T} \sum \limits _{u \in S_c} \log f_{u, c}(X). \end{aligned}$$
(2)

Note that for computing \(L_\text {seed}\) one needs the weak localization sets, \(S_c\), so that many existing techniques from the literature can be used, essentially, as black boxes. In this work, we rely on [50] for weakly localizing foreground classes. However, this method does not provide a direct way to select confident background regions, therefore we use the gradient-based saliency detection method from [32] for this purpose. We provide more details on the weak localization procedure in Sect. 4.

Fig. 2.
figure 2

The schematic illustration of the weak localization procedure.

Expansion Loss with Global Weighted Rank Pooling. To measure if a segmentation mask is consistent with the image-level labels one can aggregate segmentation scores into classification scores and apply the standard loss function for multi-label image classification. In the context of weakly-supervised segmentation/detection various techniques were used by researches to aggregate score maps into a classification scores. The most prominent ones are global max-poling (GMP) [22] that assigns any class c in any image X a score of \(\max \limits _{u \in \{1, \dots , n\}} f_{u,c}(X)\) and global average-pooling [50] that assigns it a score of \(\frac{1}{n}\!\sum \limits _{u = 1}^{n} f_{u,c}(X)\).

Both ways of aggregation have been successfully used in practice. However, they have their own drawbacks. For classes which are present in an image GMP only encourages the response for a single location to be high, while GAP encourages all responses to be high. Therefore, GMP results in a segmentation network that often underestimates the sizes of objects, while network trained using GAP, in contrast, often overestimates them. Our experiments in Sect. 4 support this claim empirically.

In order to overcome these drawbacks we propose a global weighted rank-pooling (GWRP), a new aggregation technique, which can be seen as a generalization of GMP and GAP. GWRP computes a weighted average score for each class, where weights are higher for more promising locations. This way it encourages objects to occupy a certain fraction of an image, but, unlike GAP, is less prone to overestimating object sizes.

Formally, let an index set \(I^c = \{i_{1}, \dots , i_{n}\}\) define the descending order of prediction scores for any class \(c \in \mathcal {C}\), i.e. \(f_{i_{1},c}(x) \ge f_{i_{2},c}(x) \ge \dots \ge f_{i_{n},c}(x)\) and let \(0< d_c <= 1\) be a decay parameter for class c. Then we define the GWRP classification scores, \(G_c(f(X), d_c)\), for an image X, as following:

$$\begin{aligned} G_c(f(X); d_c) = \dfrac{1}{Z(d_c)} \sum \limits _{j=1}^n (d_c)^{j-1} f_{i_j,c}(X), \; \text {where} \; Z(d_c) = \sum \limits _{j=1}^n (d_c)^{j-1}. \end{aligned}$$
(3)

Note, that for \(d_c = 0\) GWRP turns into GMP (adopting the convention that \(0^0 = 1\)), and for \(d_c=1\) it is identical to GAP. Therefore, GWRP generalizes both approaches and the decay parameter can be used to interpolate between the behavior of both extremes.

In principle, the decay parameter could be set individually for each class and each image. However, this would need prior knowledge about how large objects of each class typically are, which is not available in the weakly supervised setting. Therefore, we only distinguish between three groups: for object classes that occur in an image we use a decay parameter \(d_+\), for object classes that do not occur we use \(d_-\), and for background we use \(d_\text {bg}\). We will discuss how to choose their values in Sect. 4.

In summary, the expansion loss term is

$$\begin{aligned} L_\text {expand}(f(X), T) =&\! -\dfrac{1}{|T|} \sum \limits _{c \in T} \log G_c(f(X);\! d_+) \\&\! -\dfrac{1}{|\mathcal {C}^\prime \backslash T|} \! \sum \limits _{c \in \mathcal {C}^\prime \backslash T} \!\! \log (1 - G_c(f(X); d_-)) -\log G_{c^{\mathrm {bg}}} (f(X);\! d_\mathrm {bg}). \nonumber \end{aligned}$$
(4)

Constrain-to-boundary Loss. The high level idea of the constrain-to-boundary loss is to penalize the neural network for producing segmentations that are discontinuous with respect to spatial and color information in the input image. Thereby, it encourages the network to learn to produce segmentation masks that match up with object boundaries.

Specifically, we construct a fully-connected CRF, Q(Xf(X)), as in [17], with unary potentials given by the logarithm of the probability scores predicted by the segmentation network, and pairwise potentials of fixed parametric form that depend only on the image pixels. We downscale the image X, so that it matches the resolution of the segmentation mask, produced by the network. More details about the choice of the CRF parameters are given in Sect. 4. We then define the constrain-to-boundary loss as the mean KL-divergence between the outputs of the network and the outputs of the CRF, i.e.:

$$\begin{aligned} L_\text {constrain}(X, f(X))&=\dfrac{1}{n} \sum \limits _{u=1}^{n} \sum \limits _{c \in \mathcal {C}} Q_{u,c}(X, f(X)) \log \frac{Q_{u,c}(X, f(X))}{f_{u,c}(X)}. \end{aligned}$$
(5)

This construction achieves the desired effect, since it encourages the network output to coincide with the CRF output, which itself is known to produce segmentation that respect image boundaries. An illustration of this effect can be seen in Fig. 1.

3.2 Training

The proposed network can be trained in an end-to-end way using back-propagation, provided that the individual gradients of all layers are available. For computing gradients of the fully-connected CRF we employ the procedure from [34], which was successfully used in the context of semantic image segmentation. Figure 1 illustrates the flow of gradients for the backpropagation procedure with gray arrows.

4 Experiments

In this section we validate our proposed loss function experimentally, including a detailed study of the effects of its different terms.

4.1 Experimental Setup

Dataset and Evaluation Metric. We evaluate our method on the PASCAL VOC 2012 image segmentation benchmark, which has 21 semantic classes, including background [9]. The dataset images are split into three parts: training (train, 1464 images), validation (val, 1449 images) and testing (test, 1456 images). Following the common practice we augment the training part by additional images from [10]. The resulting trainaug set has 10,582 weakly annotated images that we use to train our models. We compare our approach with other approaches on both val and test parts. For the val part, ground truth segmentation masks are available, so we can evaluate results of different experiments. We therefore use this data also to provide a detailed study of the influence of the different components in our approach. The ground truth segmentation masks for the test part are not publicly available, so we use the official PASCAL VOC evaluation server to obtain quantitative results. As evaluation measure we use the standard PASCAL VOC 2012 segmentation metric: mean intersection-over-union (mIoU).

Segmentation Network. As a particular choice for the segmentation architecture, in this paper we use DeepLab-CRF-LargeFOV from [6], which is a slightly modified version of the 16-layer VGG network [33]. The network has inputs of size 321\(\,\times \,\)321 and produces segmentation masks of size 41\(\,\times \,\)41, see [6] for more details on the architecture. We initialize the weights for the last (prediction) layer randomly from a normal distribution with mean 0 and variance 0.01. All other convolutional layers are initialized from the publicly available VGG model [33]. Note, that in principle, our loss function can be combined with any deep convolutional neural network.

Localization Networks. The localization networks for the foreground classes and the background class are also derived from the standard VGG architecture. In order to improve the localization performance, we finetune these networks for solving a multilabel classification problem on the trainaug data. Due to space limitations we provide exact details on these networks and optimization parameters in the technical report [16].

Note, that in order to reduce the computational effort and memory consumption required for training SEC it is possible to precompute the localization cues. If precomputed cues are available SEC imposes no additional overhead for evaluating and storing the localization networks at training time.

Optimization. For training the network we use the batched stochastic gradient descent (SGD) with parameters used successfully in [6]. We run SGD for 8000 iterations, the batch size is 15 (reduced from 30 to allow simultaneous training of two networks), the dropout rate is 0.5 and the weight decay parameter is 0.0005. The initial learning rate is 0.001 and it is decreased by a factor of 10 every 2000 iterations. Overall, training on a GeForce TITAN-X GPU takes 7–8 h, which is comparable to training times of other models, reported, e.g., in [23, 24].

Decay Parameters. The GWRP aggregation requires specifying the decay parameters, \(d_-, d_+\) and \(d_{\mathrm {bg}}\), that control the weights for aggregating the scores produced by the network. Inspired by the previous research [23, 24] we do so using the following rules-of-thumb that express prior beliefs about natural images:

  • for semantic classes that are not present in the image we want to predict as few pixels as possible. Therefore, we set \(d_-=0\), which corresponds to GMP.

  • for semantic classes that are present in the image we suggest that the top 10 % scores represent 50 % of the overall aggregated score. For our 41\(\,\times \,\)41 masks this roughly corresponds to \(d_+=0.996\).

  • for the background we suggest that the top 30 % scores represent 50 % of the overall aggregated score, resulting in \(d_{\mathrm {bg}} = 0.999\).

Fig. 3.
figure 3

The schematic illustration of our approach at test time.

Fully-connected CRF at Training Time. In order to enforce the segmentation network to respect the boundaries of objects already at training time we use a fully-connected CRF [17]. As parameters for the pairwise interactions, we use the default values from the authors’ public implementation, except that we multiply all spatial distance terms by 12 to reflect the fact that we downscaled the original image in order to match the size of the predicted segmentation mask.

Inference at Test Time. Our segmentation neural network is trained to produce probability scores for all classes and locations, but the spatial resolution of a predicted segmentation mask is lower than the original image. Thus, we upscale the predicted segmentation mask to match the size of the input image, and then apply a fully-connected CRF [17] to refine the segmentation. This is a common practice, which was previously employed, e.g., in [6, 23, 24]. Figure 3 shows a schematic illustration of our inference procedure at test time.

Reproducibility. In our experiments we rely on the caffe deep learning framework [13] in combination with a python implementation of the SEC loss. The code and pretrained models are publicly availableFootnote 1.

4.2 Results

Numeric Results. Table 1 compares the performance of our weakly supervised approach with previous approaches that are trained in the same setup, i.e. using only images from PASCAL VOC 2012 and only image-level labels. It shows that SEC substantially outperforms the previous techniques. On the test data, where the evaluation is performed by an independent third party, the PASCAL VOC evaluation server, it achieves 13.5 % higher mean intersection-over-union score than the state-of-the-art approaches with new best scores on 20 out of 21 semantic classes. On the validation data, for which researchers can compute scores themselves, SEC improves over the state-of-the-art by 14.1 %, and achieves new best scores on 19 out of the 21 classes.

Table 1. Results on PASCAL VOC 2012 (mIoU in %) for weakly-supervised semantic segmentation with only per-image labels
Table 2. Summary results (mIoU %) for other methods on PASCAL VOC 2012. Note: the values in this table are not directly comparable to Table 1, as they were obtained under different experimental conditions

Results of other weakly-supervised methods on PASCAL VOC and the fully-supervised variant of DeepLab are summarized in Table 2. We provide these results for reference but emphasize that they should not simply be compared to Table 1, because the underlying methods were trained on different (and larger) training sets or were given additional forms of weak supervision, e.g. user clicks. Some entries need further explanation in this regard: [23] reports results for the EM-Adapt model when trained with weak annotation for multiple image crops. The same model was reimplemented and trained with only per-image supervision in [24], so these are the values we report in Table 1. The results reported for SN_B [41] and the seg variant of the MIL+ILP+SP [26] are incomparable to others because they were obtained with help of MCG region proposals [2] that were trained in a fully supervised way on PASCAL VOC data. Similarly, MIL+ILP+SP-bb makes use of bounding box proposals generated by the BING method [7] that was trained using PASCAL VOC bounding box annotation.

Note that we do include the sppxl variant of MIL+ILP+SP in Table 1. While it is trained on roughly 760.000 images of the ImageNet dataset, we do not consider this an unfair advantage compared to our and other methods, because those implicitly benefit from ImageNet images as well when using pretrained classification networks for initialization.

Fig. 4.
figure 4

Examples of predicted segmentations (val set, successfull cases).

Fig. 5.
figure 5

Examples of predicted segmentations (val set, failure cases).

Qualitative Results. Figure 4 illustrates typical successful segmentations. It shows that our method can produce accurate segmentations even for non-trivial images and recover fine details of the boundary. Figure 5 illustrates some failure cases. As is typical for weakly-supervised systems, SEC has problems segmenting objects that occur almost always in front of the same background, e.g. boats on water, or trains on tracks. We addressed this problem recently in follow-up work [15]. A second failure mode is that object regions can be segmented correctly, but assigned wrong class labels. This is actually quite rare for SEC, which we attribute to the fact that the DeepLab network has a large field-of-view and therefore can make use of the full image when assigning labels. Finally, it can also happen that segmentations cover only parts of objects. This is likely due to imperfections of the weak localization cues that tend to reliably detect only the most discriminative parts of an object, e.g. the face of a person. This might not be sufficient to segment the complete object, however, especially when objects overlap each other or consist of multiple components of very different appearance.

4.3 Detailed Discussion

To provide additional insight into the working mechanisms of the SEC loss function, we performed two further sets of experiments on the val data. First, we analyze different global pooling strategies, and second, we perform an ablation study that illustrates the effect of each of the three terms in the proposed loss function visually as well as numerically.

Effect of Global Pooling Strategies. As discussed before, the quality of segmentations depends on which global pooling strategy is used to convert segmentation mask into per-image classification scores. To quantify this effect, we train three segmentation networks from weak supervision, using either GMP, GAP or GWRP as aggregation methods for classes that are present in the image. For classes that are not present we always use GMP, i.e. we penalize any occurrence of these classes. In Fig. 6 we demonstrate visual results for every pooling strategy and report two quantities: the fraction of pixels that are predicted to belong to a foreground (fg) class, and the segmentation performance as measured by mean IoU. We observe that GWRP outperforms the other method in terms of segmentation quality and the fractions of predicted foreground pixels supports our earlier hypothesis: the model trained with GMP tends to underestimate object sizes, while the model trained with GAP on average overestimates them. In contrast, the model trained with GWRP, produces segmentations in which objects are, on average, close to the correct sizeFootnote 2.

Effect of the different loss terms. To investigate the contribution of each term in our composite loss function we train segmentation networks with loss functions in which different terms of the SEC loss were omitted. Figure 7 provides numerical results and illustrates typical segmentation mistakes that occur when certain loss terms are omitted. Best results are achieved when all three loss terms are present. However, the experiments also allow us to draw two interesting additional conclusions about the interaction between the loss terms.

Semi-supervised Loss and Large Field-of-View. First, we observe that having \(L_\text {seed}\) in the loss function is crucial to achieve competitive performance. Without this loss term our segmentation network fails to reflect the localization of objects in its predictions, even though the network does match the global label statistics rather well. See the third column of Fig. 7 for the illustration of this effect.

We believe that this effect can be explained by the large (378\(\,\times \,\)378) field-of-view (FOV) of the segmentation networkFootnote 3: if an object is present in an image, then the majority of the predicted scores may be influenced by this object, no matter where object is located. This helps in predicting the right class labels, but can negatively affect the localization ability. Other researchers addressed this problem by explicitly changing the architecture of the network in order to reduce its field-of-view [23]. However, networks with a small field-of-view are less powerful and often fail to recognize which semantic labels are present on an image. We conduct an additional experiment (see the technical report [16] for details) that confirm that SEC with a small (211\(\,\times \,\)211) field-of-view network performs clearly worse than with the large (378\(\,\times \,\)378) field-of-view network, see Fig. 8 for numeric results and visual examples. Thus, we conclude that the seeding loss provides the necessary localization guidance that enables the large field-of-view network to still reliably localize objects.

Fig. 6.
figure 6

Results on the val set and examples of segmentation masks for models trained with different pooling strategies.

Effects of the Expansion and Constrain-to-Boundary Losses. By construction, the constrain-to-boundary loss encourages nearby regions of similar color to have the same label. However, this is often not enough to turn the weak localization cues into segmentation masks that cover a whole object, especially if the object consists of visually dissimilar parts, such as people wearing clothes of different colors. See the sixth column of Fig. 7 for an illustration of this effect.

The expansion loss, based on GWRP, suppresses the prediction of classes that are not meant to be in the image, and it encourages classes that are in the image to have reasonable sizes. When combined with the seeding loss, the expansion loss actually results in a drop in performance. The fifth column of Fig. 7 shows an explanation of this: objects sizes are generally increased, but the additionally predicted regions do not match the image boundaries.

In combination, the seeding loss provides reliable seed locations, the expansion loss acts as a force to enlarge the segmentation masks to a reasonable size, and the constrain-to-boundary loss constrains the segmentation mask to line up with image boundaries, thus integrating low-level image information. The result are substantially improved segmentation masks as illustrated in the last column of Fig. 7.

Fig. 7.
figure 7

Results on the val set and examples of segmentation masks for models trained with different loss functions.

Fig. 8.
figure 8

Results on the val set and examples of segmentation masks for models with small or large field-of-views.

5 Conclusion

We propose a new loss function for training deep segmentation networks when only image-level labels are available. We demonstrate that our approach outperforms previous state-of-the-art methods by a large margin when used under the same experimental conditions and provide a detailed ablation study.

We also identify potential directions that may help to further improve weakly-supervised segmentation performance. Our experiments show that knowledge about object sizes can dramatically improve the segmentation performance. SEC readily allows incorporating size information through decay parameters, but a procedure for estimating object sizes automatically would be desirable. A second way to improve the performance would be stronger segmentation priors, for example about shape or materials. This could offer a way to avoid mistakes that are currently typical for weakly-supervised segmentation networks, including ours, for example that boats are confused with the water in their background.