1 Introduction

Semantic segmentation is a fundamental visual task of assigning a class to every pixel of a given image [1,2,3]. With deep convolutional neural networks (CNNs), semantic segmentation has progressed noticeably in fully supervised settings [4,5,6]. This however needs massive pixel-level dense annotations, which are difficult to acquire due to the expensive and laborious data-labeling process. It is thus desirable to develop segmentation techniques, which can rely on weak supervision to achieve performance on par with the one achieved with strong supervision. Recent years have witnessed many research efforts in semantic segmentation using weak labels, such as points [7], scribbles [8, 9], bounding boxes [10, 11] and image labels [12,13,14]. In particular, image-level labels are easy to acquire and annotate, while they also indicate the least information, i.e., the classes present in an image without information about object localization.

A critical step for WSSS is to use weak labels to produce pseudo segmentation labels. Given image labels, techniques for model interpretation such as CAM [15] and Grad-CAM [16] are able to extract object localization maps from the intermediate layers of CNNs. However, these CAM maps only indicate the most discriminative object regions, which are incomplete and do not provide sufficient semantic segmentation supervision. To obtain advanced segmentation pseudo labels, prior works have developed various strategies to discover non-discriminative object regions [17,18,19] and expand CAM maps. However, resulting CAM maps still exhibit inaccurate boundaries, leading to incorrect segmentation predictions. Besides, previous methods [20,21,22] commonly applied a pre-trained saliency detector on a target segmentation dataset, to extract useful object localization information to assist pseudo semantic label generation.

Fig. 1
figure 1

Column a shows a supervised semantic segmentation training example including an image (top) and a pixel-wise class label mask (bottom) which is not available in the WSSS task. We propose to learn class-agnostic masks for WSSS. Column b shows a pre-trained saliency map (top) which is used to initially supervise the class-agnostic mask learning and to guide the generation of an initial pseudo semantic label mask (bottom). Column c shows the improved pseudo labels for both tasks using the proposed cross-task label refinement, which provides better supervision for the network training

However, for the task of semantic segmentation, off-the-shelf saliency maps can also introduce misleading object information due to the gap between the pre-trained salient objects and the object of interest. For example, the detected saliency object by a pre-trained saliency model is the dog house (Fig. 1b top), while the target object for segmentation is the dog (Fig. 1a bottom). This could thus lead to inaccurate pseudo labels, and the network training is vulnerable to those errors.

These observations indicate that having precise object boundary information is important to both segmentation prediction and pseudo segmentation label generation. We thus propose to construct a class-agnostic mask learning task while exploiting supervision from off-the-shelf saliency maps. This is performed jointly with semantic segmentation using a multi-branch network. The benefits are two-fold: (i) the multi-task joint learning can regularize feature learning and learn more robust features for semantic segmentation, especially to differentiate between foreground objects and backgrounds; Moreover, it can effectively improve the generalization ability of the network by leveraging the useful information from related tasks resulting in a stronger inductive bias compared to single-task learning, thus reducing the overfitting problem; (ii) the learned class-agnostic maps can also contribute to the pseudo segmentation label generation. In particular, off-the-shelf saliency maps are not only utilized to generate initial pseudo semantic labels (Fig. 1b bottom), and they are also used to provide initial supervision to learning class-agnostic masks, by incorporating online class-agnostic mask predictions for self-refinement. Compared to off-the-shelf saliency maps, such learned class-agnostic masks are more adaptive to the target semantic segmentation task and contain more accurate object localization information. We further propose a cross-task label refinement mechanism to take advantage of the learned semantic segmentation masks and class-agnostic masks, thereby producing refined pseudo labels for both tasks (Fig. 1c). Moreover, we propose a new normalization method for CAM to generate class-specific localization maps, which can cover entire object regions. By combining the improved CAM maps and the proposed discriminative foreground-background class-agnostic masks, pseudo semantic labels can be substantially improved to better optimize the whole deep neural network.

Our contributions are summarized as follows:

  • We propose to jointly learn class-agnostic masks and semantic segmentation using image labels and off-the-shelf saliency maps. Such an approach is shown to lead to an improved segmentation performance, and this is also shown to provide more reliable class-agnostic masks for pseudo label generation. We leverage a new normalization method for CAM to produce class-specific localization maps (i.e., pCAM), which can cover entire object regions. The resulting pCAM maps complement the class-agnostic maps, producing high-quality pseudo semantic labels.

  • We introduce a cross-task label refinement mechanism, which jointly leverages predictions from the tasks of class-agnostic and semantic segmentation with pCAM maps to refine their pseudo labels. This mechanism is shown to effectively correct the errors brought by the pre-trained saliency model, providing more accurate supervision to learn semantic segmentation and class-agnostic masks.

  • The proposed method achieves superior WSSS results compared to state-of-the-art methods on PASCAL VOC 2012 and MS COCO (Sec. 4.2).

The rest of this paper is organized as follows. We review related work in Sect. 2 and describe the proposed approach in Sect. 3. Section 4 presents experimental details with ablation studies and discusses the results. Section 5 concludes the paper.

2 Related work

This section presents a literature review on recent image-label supervised semantic segmentation approaches, including CAM based refinement and semantic prediction-based refinement.

2.1 CAM map-based refinement

Most existing WSSS approaches have utilized the object localization information of the CAM maps to produce pseudo semantic labels. However, the raw CAM maps only indicate the most discriminative object regions, which are small and sparse. A typical example of improving CAM maps is the heuristics-based object mining. By iteratively erasing the detected object regions of input images [23], the network is driven to learn new patterns for classification. Similar techniques based on heuristic erasing have been presented in [24, 25]. Jiang et al. [21] observe that the classification network attends to different object regions during training and obtain an integral object localization map by accumulating CAM maps online. Since the sole reliance on the conventional classification objective loss function leads to incomplete CAM maps, prior works apply different regularization methods for training the classification network to obtain improved CAM maps. Wang et al. [26] suggest that imposing an equivariance constraint on the CAM maps under any spatial affine transformation can result in maps which better fit the shape of objects. Fan et al. [18] observe that the standard classification objective only focuses on the discrimination between different object classes, ignoring the boundaries between each class and the backgrounds. They thus propose to learn an intra-class boundary based on the implicit feature manifold. Chang et al. [27] re-formulate the problem into a fine-grained classification task, for which the pseudo labels of sub-categories are extracted from unsupervised feature clustering. Moreover, cross-image relations have also been explored to enhance the representations for extracting CAM maps, e.g., in [19] and [28]. A recent work by Zhang et al. [29] argues that it is the confounding context from the dataset that causes the ambiguous boundaries of CAM maps. Subsequently, they propose to use class-specific average segmentation masks to approximate the confounding set and incorporate it into the image classification to obtain better CAM maps. In contrast, we propose a different normalization method for CAM, which generates more complete CAM maps, compared to the original CAM maps.

2.2 Semantic prediction-based refinement

There are several other methods which focus on refining the pseudo segmentation labels, which are usually produced by the raw CAM maps, by taking advantage of the segmentation predictions. For instance, Wang et al. [30] refine semantic pseudo labels by discovering object affinities based on super-pixel regions derived from the segmentation prediction. Similarly, Wang et al. [31] iteratively select reliable regions from the segmentation outputs to learn pixel-wise affinities, which are then utilized to refine the segmentation results and produce pseudo segmentation labels. Araslanov et al. [32] propose to refine the segmentation results based on image local consistency so as to obtain pseudo semantic labels to enable the optimization of the segmentation network.

Although more non-discriminative object regions are discovered by these complex methods, their resulting pseudo segmentation labels generally have coarse boundaries. Therefore, a number of methods [19,20,21, 23, 24, 33, 34] have exploited background cues from off-the-shelf saliency maps to assist pseudo semantic label generation. However, these pre-trained saliency models are not generally adapted well to the semantic segmentation task. In this work, we address this problem by formulating a task of learning class-agnostic masks and incorporating it into a joint learning framework with semantic segmentation to obtain more generalizable representations. Moreover, in order to provide better supervision to the learning of class-agnostic maps, we combine the pre-trained saliency maps and the online predicted class-agnostic maps which can provide complementary and progressively more accurate class-agnostic localization information. We also propose a cross-task label refinement mechanism to further refine pseudo labels to learn both class-agnostic and semantic segmentation masks.

Fig. 2
figure 2

An overview of the proposed method. We propose to improve WSSS by learning class-agnostic masks, which is formulated as a multi-task problem, i.e., semantic segmentation (SS), class-agnostic (CA) segmentation and multi-label image classification. Because only image labels are provided, to supervise the learning of class-agnostic and semantic segmentation masks, we use a pre-trained saliency map as (f) an initial pseudo class-agnostic mask and progressively refine it by incorporating the online class-agnostic prediction. Moreover, we combine (c) the proposed pCAM maps (illustrated in Fig. 3) extracted from the classification branch and a pre-trained saliency map to generate (e) an initial pseudo semantic segmentation mask using the conventional thresholding-based procedures. Once the initial network training converges, we introduce a cross-task label refinement module (illustrated in Fig. 4) to use (b) the predicted semantic segmentation mask and (d) the predicted class-agnostic mask to produce (g) a refined pseudo semantic segmentation mask and (h) a refined pseudo class-agnostic mask to re-train the network

3 The proposed method

This section starts with an overview, followed by the description of our weakly supervised multi-task network architecture. The subsequent subsection describes the proposed normalization method for CAM to produce improved class-specific localization maps, which constitute a key component of the pseudo semantic label generation process. The following subsection elaborates the details of the proposed cross-task pseudo label generation for class-agnostic and semantic segmentation tasks. The final subsection presents the model training and inference processes.

3.1 Overview

Figure 2 presents an overview of the proposed approach. We build a multi-branch network to jointly perform semantic segmentation, class-agnostic segmentation and image classification tasks, with only image-level annotations. Besides, we utilize a general pre-trained saliency model to generate binary maps as a guide to provide supervision for the learning of the other two tasks. More specifically, we propose a different normalization method for CAM maps, generating more complete class-specific object localization maps. The improved CAM maps are combined with the pre-trained saliency maps to produce better initial pseudo semantic segmentation labels. For the class-agnostic learning, the pre-trained saliency maps are initially used as pseudo labels and are gradually refined by combining the online class-agnostic predictions. Once the training is complete, we propose a cross-task label refinement mechanism, which jointly takes advantage of the class-agnostic and semantic segmentation predictions to produce improved pseudo class-agnostic and semantic segmentation labels. The refined pseudo labels are then leveraged to fine-tune the multi-task network, leading to improved semantic segmentation results.

3.2 Weakly supervised multi-task network architecture

We build our deep network based on ResNet38 [35], which has 38 convolutional layers with wide channels. Following [36], we make modifications to the original ResNet38 to construct a backbone network with an output stride of 8. In order to learn more robust and informative representations for weakly supervised semantic segmentation, we adopt three branches following the backbone network, i.e., an image classifier, a class-agnostic segmentation decoder, and a semantic segmentation decoder. More specifically, given a RGB image as input, the backbone network produces an activation map \({\textbf{F}} \in {\mathbb {R}}^{H\times W\times K}\), with K, H and W indicating its number of channels and two spatial dimensions, respectively. For the classification task, a Global Average Pooling (GAP) layer is applied on the backbone feature maps. The resulting feature vector is forwarded into a fully connected (fc) layer, predicting the class probabilities. For the class-agnostic segmentation branch, the backbone feature maps are forwarded to a DenseASPP module [37], which is composed of three cascaded atrous convolutional layers (aconv) (rates = 6, 12, and 18). Finally, a 1\(\times \)1 convolutional (conv) layer, with a sigmoid layer, is applied to predict the class-agnostic masks. Moreover, the segmentation decoder includes three aconv layers (rates = 6, 12, and 18), and one last \(1\times 1\) conv layer, with a softmax layer, for semantic segmentation prediction.

3.3 Generating class probability-based CAM maps

We use CAM [15] to produce class-specific localization maps for the generation of semantic segmentation pseudo labels. More specifically, for a given class c and spatial coordinates (ij), the CAM map is calculated as follows:

$$\begin{aligned} \textbf{CAM}_c(i, j) = \sum _{k}^{K}{\textbf{W}}_{k}^{c}{\textbf{F}}_{k}(i,j), \end{aligned}$$
(1)

where \({\textbf{W}} \in {\mathbb {R}}^{K\times C}\) is the weight matrix of the last fc layer, with C denoting the number of classes, and \({\textbf{W}}_{k}^{c}\) represents the importance score of the channel k to the class c. As shown in Fig. 3, the generated CAM map for the class c is processed via the min-max normalization along the spatial dimensions, referred to as \(\textrm{sCAM}\):

$$\begin{aligned} \textbf{sCAM}_{c}(i,j)=\frac{\textrm{ReLU}(\textbf{CAM}_c(i,j))}{\max _{(i,j)}\textbf{CAM}_{c}}. \end{aligned}$$
(2)
Fig. 3
figure 3

Illustration of the normalization methods to generate sCAM and the proposed pCAM maps: sCAM maps [15] are generated by applying the max-min normalization along the spatial dimensions (i.e., \(H~\times ~W\)); our proposed pCAM maps are generated by applying softmax along the channel dimension (i.e., C)

In contrast to sCAM, we propose to use a different normalization method to generate CAM maps based on class probabilities, hereinafter referred to as pCAM. More specifically, as illustrated in Fig. 3, pCAM maps are produced by applying the softmax operation along the channel dimension. As a result, each spatial vector of the resulting pCAM map represents the class probability distribution of the corresponding pixel:

$$\begin{aligned} \textbf{pCAM}_{c}(i,j) = \frac{\exp {(\textbf{CAM}_{c}(i,j)})}{\sum \nolimits _c\exp {(\textbf{CAM}_{c}(i,j))}}. \end{aligned}$$
(3)

sCAM tends to highlight the most discriminative regions among all spatial locations. In contrast, the proposed pCAM focuses on highlighting the pixels which have large probabilities for the given class. For the classes present in a given image, their corresponding CAM maps tend to have higher activation values, compared to those CAM maps of classes absent in the image. Therefore, the class activated regions by pCAM are larger than those given by sCAM.

3.4 Cross-task pseudo label generation

This section presents the proposed two-step method of the generation of class-agnostic and semantic segmentation pseudo labels.

3.4.1 Initial pseudo label generation

To learn class-agnostic masks, given no ground truth, we propose to utilize a coarse saliency label map \({{\textbf{P}}}{{\textbf{t}}}_{sal}\) estimated by a pre-trained saliency model as initial guide and incorporate complementary information from online predictions. More specifically, the pre-trained saliency model generally yields reasonable results on the source images, and it is however error-prone when applied on complex images with low contrast or complex backgrounds due to its limited generalization ability on different target datasets. Moreover, the detected salient object may not be the object of interest for the target task. In contrast, with the shared backbone features, the predicted class-agnostic mask, denoted as \({{\textbf{P}}}{{\textbf{r}}}_\text {ca}\), contains useful object localization information, which becomes more reliable with the learning. Therefore, we propose to generate pseudo class-agnostic label masks \({\textbf{G}}^{init}_\text {ca}\) by fusing these two complementary sources through a Conditional Random Field (CRF) model:

$$\begin{aligned} \mathbf {{\textbf{G}}}^{init}_\text {ca} = \textrm{CRF}_d\left(\frac{\mathbf {{{\textbf{P}}}{{\textbf{r}}}}_\text {ca} + \mathbf {{{\textbf{P}}}{{\textbf{t}}}}_{sal}}{2}\right), \end{aligned}$$
(4)

where \(\textrm{CRF}_d(\cdot )\) denotes a densely connected CRF [38] which uses the average of \({{\textbf{P}}}{{\textbf{r}}}_\text {ca}\) and \({{\textbf{P}}}{{\textbf{t}}}_{sal}\) as a unary term. The fused output from the CRF model is more adapted to the target dataset, thereby providing better supervision to learn class-agnostic masks. To generate initial pseudo segmentation labels \({\textbf{G}}^{init}_\text {seg}\), we follow previous works [19,20,21, 23, 24, 33] to use the thresholding-based rules to combine pCAM maps and off-the-shelf saliency maps. More specifically, we determine the potential object regions from pCAM maps and off-the-shelf saliency maps using the pre-defined thresholds, which are not sensitive to the final segmentation performance. The overlapping object regions in these two maps are assigned with the object labels of the largest class probabilities. Conflicting object regions, such as pixels belonging to the foreground objects of the saliency map but regarded as background by the pCAM map and vice versa, are not used to compute the loss. The remaining pixels are labeled as background.

Fig. 4
figure 4

An illustration of the proposed cross-task label refinement for the pseudo label update of the class-agnostic (CA) and the semantic segmentation (SS) tasks

3.4.2 Cross-task label refinement

When the joint multi-task optimization converges, the improved predictions from all three tasks can be utilized for cross-task refinement. This yields improved pseudo labels for class-agnostic and semantic segmentation, which can further boost multi-task learning. Figure 4 depicts the computation flow of the proposed cross-task refinement module. Given the predicted class-agnostic mask \({{\textbf{P}}}{{\textbf{r}}}_\text {ca}\) and the predicted semantic map \({{\textbf{P}}}{{\textbf{r}}}_\text {seg}\), we perform a structured fusion of these two types of predictions to obtain the refined class-agnostic pseudo label mask \({\textbf{G}}^{ref}_\text {ca}\) as follows:

$$\begin{aligned} {\textbf{G}}^{ref}_\text {ca} = \textrm{CRF}_d\left(\frac{{{\textbf{P}}}{{\textbf{r}}}_\text {ca} + \textrm{Br}_s({{\textbf{P}}}{{\textbf{r}}}_\text {seg})}{2}\right), \end{aligned}$$
(5)

where \(\textrm{Br}_s(\cdot )\) is a binarization operation on the segmentation probability map, outputting a one-channel map \({{\textbf{P}}}{{\textbf{r}}}'_\text {seg}\) with values of 0 and 1; the model \(\textrm{CRF}_d\) shares the same parameters with that used in Eq. 4. More specifically, \(\textrm{Br}_s\) first converts the segmentation map \({{\textbf{P}}}{{\textbf{r}}}_\text {seg}\) into a one-channel map and then, binarize it with label 1 representing ‘foreground’ and label 0 ‘background’ as follows:

$$\begin{aligned} {{\textbf{P}}}{{\textbf{r}}}'_\text {seg}&= \mathop {\mathrm {arg\,max}}\limits \limits _{c}\textrm{Supp}({{\textbf{P}}}{{\textbf{r}}}_\text {seg}), \end{aligned}$$
(6)
$$\begin{aligned} {{\textbf{P}}}{{\textbf{r}}}'_\text {seg}(i,j)&= {\left\{ \begin{array}{ll} 1&{} \text { if } {{\textbf{P}}}{{\textbf{r}}}'_\text {seg}(i,j) > 0, \\ 0&{} \text { if } {{\textbf{P}}}{{\textbf{r}}}'_\text {seg}(i,j) = 0, \end{array}\right. } \end{aligned}$$
(7)

where \(\textrm{Supp}(\cdot )\) denotes a suppression function, which multiplies \({{\textbf{P}}}{{\textbf{r}}}_\text {seg}\) by the image-level labels across the class channel to suppress incorrect predictions. Then, similar to the procedures of initial pseudo semantic label generation, we combine the pCAM map and the refined class-agnostic pseudo label mask \({\textbf{G}}^{ref}_\text {ca}\) to obtain refined pseudo semantic label \({\textbf{G}}^{ref}_\text {seg}\). Finally, the refined pseudo class-agnostic label masks \({\textbf{G}}^{ref}_\text {ca}\) and the refined pseudo semantic label masks \({\textbf{G}}^{ref}_\text {seg}\) are used together with the image labels to re-train the overall network.

3.5 Training and inference

3.5.1 Training

Our overall learning objective function is formulated as follows:

$$\begin{aligned} {\mathcal {L}}_\text {all}= {} {\mathcal {L}}_\text {cls} + {\mathcal {L}}_\text {ca} + {\mathcal {L}}_\text {seg}, \end{aligned}$$
(8)
$$\begin{aligned}{{\cal L}_{{\text{cls}}}} = - \sum\limits_{i = 1}^N {\left[ {{\bf{G}}_{{\text{cls}}}^i\log \frac{{\exp ({\bf{Pr}}_{{\text{cls}}}^i)}}{{1 + \exp ({\bf{Pr}}_{{\text{cls}}}^i)}}{\text{ }} + (1 - {\bf{G}}_{{\text{cls}}}^i)\log \frac{1}{{1 + \exp ({\bf{Pr}}_{{\text{cls}}}^i)}}} \right]} \end{aligned}$$
(9)
$$\begin{aligned} {\mathcal {L}}_\text {ca}= -\sum \limits ^{M}_{j=1}\big [{\textbf{G}}_\text {ca}^{j}\log {{\textbf{P}}}{{\textbf{r}}}_\text {ca}^{j}+ (1-{\textbf{G}}_\text {ca}^{j})\log (1-{{\textbf{P}}}{{\textbf{r}}}_\text {ca}^{j})], \end{aligned}$$
(10)
$$\begin{aligned} {\mathcal {L}}_\text {seg}= -\sum \limits ^{M}_{j=1}\sum \limits ^{N}_{i=1} {\textbf{G}}_\text {seg}^{i,j}\log {{\textbf{P}}}{{\textbf{r}}}_\text {seg}^{i,j}, \end{aligned}$$
(11)

where \({\mathcal {L}}_\text {cls}\) is a multi-label soft margin loss calculated between the class predictions \({{\textbf{P}}}{{\textbf{r}}}_\text {cls}\) and the multi-hot image labels \({\textbf{G}}_\text {cls}\); \({\mathcal {L}}_\text {ca}\) is a binary cross-entropy loss computed between the predicted class-agnostic masks \({{\textbf{P}}}{{\textbf{r}}}_\text {ca}\) and the class-agnostic pseudo label masks \({\textbf{G}}_\text {ca}\); and \({\mathcal {L}}_\text {seg}\) is a pixel-wise cross-entropy loss computed between the semantic segmentation predictions \({{\textbf{P}}}{{\textbf{r}}}_\text {seg}\) and the pseudo semantic labels \({\textbf{G}}_\text {seg}\). N and M denote the numbers of classes of a dataset and pixels of an input image, respectively.

Fig. 5
figure 5

The learning pipeline of the proposed approach. Our training process has four steps. a The classification branch is first optimized to produce pCAM maps, which are then utilized to produce initial pseudo semantic label masks with the guide of pre-trained saliency maps. b Then, the entire multi-branch network is trained to predict semantic segmentation and class-agnostic masks using the initial pseudo semantic labels and the pre-trained saliency maps. c The network predictions are jointly used in the proposed cross-task label refinement, producing refined pseudo semantic and class-agnostic label masks. d Finally, we re-train the multi-task network using the updated pseudo labels produced by (c)

Figure 5 illustrates the proposed pipeline. More specifically, the classification branch of the multi-task network is first trained with other two branches frozen for 15 epochs to extract pCAM maps. The initial pseudo semantic label masks are then produced by fusing pCAM maps and off-the-shelf saliency maps. With initial pseudo labels, the entire network is then trained for 15 epochs. Afterward, we perform the cross-task label refinement using the learned class-agnostic masks and semantic segmentation masks and obtain refined pseudo labels for the two tasks. Subsequently, the overall multi-task model is re-trained for 15 epochs with these updated refined pseudo labels.

4 Experiments

4.1 Experimental settings

4.1.1 Datasets

To evaluate the proposed method, we conducted experiments on PASCAL VOC 2012 [39] and MS COCO datasets [40]. PASCAL VOC has 20 object classes and one background class for semantic segmentation. This dataset is split into training (train), validation (val) and test sets with 1464, 1449 and 1456 images, respectively. Following common practice, e.g., [41, 42], additional images from [43] are used to augment the training set, resulting in a total of 10,582 training images. MS COCO has 80 object classes, one background class, 80K images for training and 40K images for validation.

4.1.2 Evaluation metrics

We followed the prior works [17, 20, 21, 30, 33, 36, 44] to use the mean Intersection-over-Union (mIoU) of all classes between the predicted semantic segmentation masks and the pixel-wise ground-truth label masks to evaluate the segmentation performance of the proposed method. Moreover, mIoU and F1-score were used to evaluate the quality of the pseudo segmentation labels. The results on the PASCAL VOC test set were obtained from the official PASCAL VOC online evaluation server.

Table 1 Segmentation results of WSSS methods in mIoU (%) on the PASCAL VOC val and test sets
Fig. 6
figure 6

Qualitative segmentation results on the PASCAL VOC val set. a Inputs. b Ground truth. c Our results

4.1.3 Implementation details

We used PyTorch [51] to implement all experiments. To train the proposed network, we used data augmentation techniques including random horizontal flipping, random scaling with a factor of \(\pm\, 0.3\), random cropping to size \(321\times 321\) and color jittering. Besides, we used the stochastic gradient descent (SGD) optimizer with a mini-batch of 4, and we set the initial learning rate as 0.001 using the polynomial with a power of 0.9. The off-the-shelf saliency maps were generated from the pre-trained DSS model [38] (widely adopted in prior arts [18, 19, 21, 24]), except if specified otherwise. For the pseudo semantic label generation, the thresholds to determine the potential object regions from class-agnostic masks and pCAM maps were empirically set as 0.5 and 0.8, respectively. For testing, we used CRFs with the hyper-parameters suggested in [41] to postprocess the segmentation predictions.

Table 2 Per-class segmentation results of state-of-the-art WSSS methods in IoU (%) on PASCAL VOC

4.2 Comparisons with state-of-the-arts

4.2.1 PASCAL VOC

Table 1 reports the segmentation results of the proposed approach against that of state-of-the-art WSSS approaches on PASCAL VOC. The proposed approach achieved mIoUs of 69.7% and 69.9% on the val and test sets, respectively, outperforming other methods. In particular, the proposed network obtained superior results even without exploiting cross-task label refinement compared to most recent methods. Detailed per-class segmentation IoU results are shown in Table 2. Figure 6 visualizes our predicted segmentation masks on PASCAL VOC val set, showing accurate boundaries with fine-grained details.

Table 3 Segmentation results of WSSS methods in mIoU (%) on the MS COCO val set
Fig. 7
figure 7

Qualitative segmentation results on the MS COCO val set. a Inputs. b Ground truth. c Our results

4.2.2 MS COCO

We also compared our results with recent WSSS methods on the MS COCO val set in Table 3 and provided the detailed results of per-class IoU in Table 4. Our method achieves 33.3% on mIoU against state-of-the-art approaches. Several qualitative segmentation results in Fig. 7 show that our approach can well segment objects at different scales in various indoor and outdoor scenes.

4.3 Ablation analysis

4.3.1 Effect of jointly learning multiple tasks

In Table 5, We compared the performance of jointly learning multiple weakly supervised tasks with the baseline method which only performs semantic segmentation. Note that we used the same initial pseudo semantic segmentation labels (see Fig. 5a) to train the different variants of the network. We can observe that jointly learning either image classification or class-agnostic masks with semantic segmentation under weak supervision significantly improves the segmentation results. In particular, learning class-agnostic masks attain a larger performance boost of 2%. Furthermore, jointly learning all these three tasks attains the best mIoU of 65% without postprocessing. This indicates that jointly learning weakly supervised multiple tasks can boost the feature learning of semantic segmentation to achieve more accurate predictions.

Table 4 Per-class segmentation results of state-of-the-art WSSS methods in IoU (%) on the COCO validation set
Table 5 Segmentation performance using different architecture configurations on PASCAL VOC 2012 val set in mIoU (%)
Table 6 Comparison between sCAM and pCAM in terms of their resulting semantic segmentation (SS) pseudo labels and semantic segmentation performance on PASCAL VOC
Fig. 8
figure 8

Visualization of sCAM maps and the proposed pCAM maps. a Input. b sCAM maps [15]. c The proposed pCAM maps

Table 7 Evaluation of semantic segmentation pseudo labels before and after applying the proposed cross-task label refinement (CTLR) with different pre-trained saliency models on the PASCAL VOC train set
Fig. 9
figure 9

Segmentation results of different configurations on the PASCAL VOC val set using DSS [38] and DHS [52] pre-trained saliency models, respectively. Baseline, JL and CTLR denote single-task semantic segmentation, joint learning of multiple tasks and cross-task label refinement, respectively

4.3.2 Comparison of different CAM maps

We compared two types of CAM maps (i.e., pCAM and sCAM using different normalization methods). As shown in Fig. 8, the sCAM maps only focus on small and local discriminative regions. In contrast, the proposed pCAM maps cover entire object regions. We also compared the pseudo semantic labels generated by pCAM to that by sCAM incorporating the same off-the-shelf saliency maps in mIoU and F1-score. As shown in Table 6, the generated semantic segmentation pseudo labels of pCAM are significantly better compared to that of sCAM in both the mIoU and F1-score results. Accordingly, pCAM achieves superior segmentation results, outperforming sCAM by a large margin.

4.3.3 Effect of cross-task label refinement

We evaluated the quality of pseudo semantic labels using mIoU and F1-score, which account for both precision and recall measurements and thus are indicative of the accuracy and completeness of the labeling, as well as the segmentation performance. Table 7 shows that after applying the proposed cross-task label refinement, both the mIoU and F1-score of the pseudo segmentation ground truth are increased significantly. Figure 9 reports similar increasing trends on the segmentation performance using DSS [38] and DHS [52] pre-trained saliency models, respectively. As visualized in Fig. 10, compared to the off-the-shelf saliency maps from the pre-trained DSS model, the refined class-agnostic masks exhibit more accurate object boundary information in various challenging scenarios, such as images of multiple object instances or objects with low contrast or with complex background. As shown in the last two rows, there are more pixels being correctly labeled in the updated semantic segmentation pseudo labels which are more proximate to ground truth. This indicates that the proposed cross-task label refinement can provide better pseudo semantic labels.

Fig. 10
figure 10

Visualization of pseudo labels for class-agnostic and semantic segmentation on the PASCAL VOC train set. a Inputs; b Semantic segmentation ground-truth labels; c Initial pseudo class-agnostic labels using DSS model [38]; d Refined pseudo class-agnostic labels after applying the proposed cross-task label refinement (CTLR); e Initial pseudo semantic labels; f Refined pseudo semantic labels after applying the proposed CTLR

4.3.4 Effect of postprocessing

We evaluated the effects of two postprocessing methods, which are commonly used in fully supervised semantic segmentation, i.e., (i) testing with inputs of multiple scales (e.g., 0.5, 0.75, 1.0, 1.25, 1.5 are used in this experiment) and (ii) using CRF. As shown in Table 8, without postprocessing, the proposed method produces an mIoU of 66.1%. Fusing the results of multi-scale inputs by max-pooling boosts mIoU to 67.2%. Only using CRF brings an improvement of 2.1%, compared to not using any postprocessing method. With both multi-scale testing and CRF, the proposed model yields the best segmentation result of 69.7%.

Table 8 Segmentation performance of the proposed approach with different postprocessing methods on the PASCAL VOC val set

5 Conclusion

In this work, we propose to improve WSSS by learning and refining class-agnostic masks. This brings two significant benefits, i.e., the enhanced feature representation for semantic segmentation and the improved object localization information for pseudo semantic label generation. For the latter, we propose a new normalization method to generate improved CAM maps. In addition, we propose a cross-task label refinement mechanism to jointly use class-agnostic and semantic segmentation predictions to generate further refined pseudo labels for both tasks. We have conducted extensive experiments on PASCAL VOC and MS COCO, achieving state-of-the-art WSSS results. The limitation of the proposed approach lies in that it relies on a multi-step training procedure. A future potential improvement of the proposed approach would be to develop an end-to-end framework which integrates the cross-task pseudo label updating process into the training of the weakly supervised multi-task network to achieve online cross-task label refinement.