Enhancing Semi Supervised Semantic Segmentation Through Cycle-Consistent Label Propagation in Video

Addanki, Veerababu; Yerramreddy, Dhanvanth Reddy; Durgapu, Sathvik; Boddu, Sasi Sai Nadh; Durgapu, Vyshnav

doi:10.1007/s11063-024-11459-6

Enhancing Semi Supervised Semantic Segmentation Through Cycle-Consistent Label Propagation in Video

Open access
Published: 06 February 2024

Volume 56, article number 4, (2024)
Cite this article

Download PDF

You have full access to this open access article

Neural Processing Letters Aims and scope Submit manuscript

Enhancing Semi Supervised Semantic Segmentation Through Cycle-Consistent Label Propagation in Video

Download PDF

Veerababu Addanki¹,
Dhanvanth Reddy Yerramreddy¹,
Sathvik Durgapu¹,
Sasi Sai Nadh Boddu¹ &
…
Vyshnav Durgapu²

398 Accesses
1 Altmetric
Explore all metrics

Abstract

To perform semantic image segmentation using deep learning models, a significant quantity of data and meticulous manual annotation is necessary (Mani in: Research anthology on improving medical imaging techniques for analysis and intervention. IGI Global, pp. 107–125, 2023), and the process consumes a lot of resources, including time and money. To resolve such issues, we introduce a unique label propagation method (Qin et al. in IEEE/CAA J Autom Sinica 10(5):1192–1208, 2023) that utilizes cycle consistency across time to propagate labels over longer time horizons with higher accuracy. Additionally, we acknowledge that dense pixel annotation is a noisy process (Das et al. in: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 5978–5987, 2023), whether performed manually or automatically. To address this, we present a principled approach that accounts for label uncertainty when training with labels from multiple noisy labeling processes. We introduce two new approaches; Warp-Refine Propagation and Uncertainty-Aware Training, for improving label propagation and handling noisy labels, respectively, and support the process with quantitative and qualitative evaluations and theoretical justification. Our contributions are validated on the Cityscapes and ApolloScape datasets, where we achieve encouraging results. In later endeavors, the aim should be to expand such approaches to include other noisy augmentation processes like image-based rendering methods (Laraqui et al. in Int J Comput Aid Eng Technol 18(5):141–151, 2023), thanks to the noisy label learning approach.

U-Net: Convolutional Networks for Biomedical Image Segmentation

Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation

UNet++: A Nested U-Net Architecture for Medical Image Segmentation

1 Introduction

While machine learning has enabled highly performative modeling for various computer vision tasks [5, 6], large quantities of annotated data are still necessary to achieve futuristic results [7]. This predicament is further exacerbated for tasks such as semantic image segmentation, where the extensive data annotation requirements for training can be prohibitively expensive. Various semi-supervised [8,9,10], and self-supervised methods [11, 12] have been proposed to alleviate this data-bottleneck, however, they bring their own set of challenges such as domain-shift.

While dense annotation is costly [13, 14], collecting raw sequential image data is relatively cheap. This is especially the conventional methodology for autonomous driving datasets, such as [14,15,16]. Such datasets are often sparsely annotated across time (for example, in [14], pixel-level segmentation is provided once in every continuous partition of 30 frames). To assuage the data bottleneck, we can generate approximated labels for the unlabelled samples. While weak labels can be generated via lazy labelling [17], or pseudo-semantic annotations [18, 19], the presence of an annotated sample in a continuous temporal sequence motivates the use of label propagation [20, 21] for generating the approximated labels of neighboring frames.

The two primary concerns regarding label propagation are:

a) How to perform it? and

b) How to utilize the generated labels? Our work addresses both, devising a novel technique to improve label propagation, and a principled approach for training with generated noisy labels.

Previous methods [22,23,24], due to their reliance on geometric cues, have several of the same drawbacks as that of optical flow estimation [25]. Specifically, prominent features of real-world data such as the lack of brightness consistency, or large motions, introduce harmful noise in the propagated labels. This drawback is further aggravated by the accumulation of errors as labels are propagated further in time. This is crucial, as longer propagation can yield more diverse labelled data, which is more beneficial for training deep neural networks [24].

Specifically, we make two important observations w.r.t previous propagation methods [22, 24]:

1) They do not leverage the semantic knowledge in the annotated dataset, and

2) The noise in labels propagated with geometric methods has a systemic component which can be modelled.

Our strategy, called Warp-Refine Propagation, addresses both of these concerns. This method consists of two steps, a label warping step followed by a label refinement step. At the heart of our approach is the concept of cyclic consistency of labels, which we explain in detail in Sect. 3.3. Through Fig. 1, it is evident that our mechanism fares better than others.

In spite of the improvements, the propagated labels can still be noisy, especially when propagation takes place for larger time steps [26]. Thus, we formulate Uncertainty-Aware Training, a principled approach for training with noisy labels which contrasts previous heuristic methods such as label relaxation [22, 27] or loss weighting [24]. Our training process involves teaching the model to calculate the likelihood of the labels with errors, based on the actual distribution of true labels. This allows us to estimate the uncertainty of each label generation process, mitigating the drawbacks of training with noisy labels. Furthermore, our approach can be used for multiple noisy distributions at the same time. Hence, with minimal changes to our model, we are also able to use pseudo-labelling [18, 19], i.e. noisy predictions from a pretrained-model, for data points where propagation is not possible. In Sect. 3.3, we draw the link between our approach and modelling of aleatoric uncertainty [28] for different data generation processes.

By using our propagation method, and our noisy label learning approach, we attained futuristic outcomes for the extensive datasets, namely Cityscapes, and Apollo-Scape. Further, our intention remains to provide quantitative evaluation of various propagation methods which has crucially been missing from previous literature.

In summary, we

Improve label propagation with a new approach called Warp-Refine Propagation, and provide quantitative as well as qualitative evaluation for the same.
Propose Uncertainty-Aware Training, a principled approach utilizing noisy labels, for which we further provide theoretical justification.
Utilize our proposed advancements for attaining futuristic outcomes from two real-world datasets.

2 Related Works

The works [29, 30] were one of the first works to introduce deep architectures for the task of semantic segmentation. Multiple generations of work [31,32,33,34], have further improved the performance, albeit at the cost of more computation mode labelled data requirement. A large corpus of previous literature alleviates the data bottleneck via semi-supervised approaches [10, 35], domain adaptation [9, 36,37,38,39,40], and data augmentation [8, 41].

Works such as [42], assuage the label bottleneck via creation of pseudo-labels for a large set of unlabelled images. This pseudo-labelling utilizes the semantic knowledge in the labelled dataset. In contrast, works such as [23, 24, 43] propose label propagation, where geometric approaches such as video-prediction [44], are utilized for generating labels. In such works, the geometric knowledge is utilized for generating pseudo-labels. As far as we know, our approach is the first to incorporate both semantic and geometric comprehension to propagate labels.

The integration of semantic and geometric comprehension has been previously investigated for other tasks [45,46,47]. We stand out by ensuring the accurate distribution of labels, which requires a unique loss function based on consistent labeling throughout cycles. The idea of cyclic consistency has been previously employed to acquire knowledge about object embeddings [48, 49], and video interpolation [50]. Our work is inspired from [48], where cyclic consistency is used for learning object embeddings using a robust tracker. However, unlike [48], our route addresses the noisy nature of our tracking/geometric modeling method itself.

Previous works deal with the noise in propagated labels by defining heuristics such as trust-factor [24] or label-relaxation [22]. Our approach is grounded in principles that involve modeling the uncertainty associated with labeling. Recently, [28, 51,52,53] have explored modelling of uncertainty in the context of deep learning models. Furthermore, works such as [54, 55], have explored the relation between aleatoric uncertainty, and label noise. In [54], the authors propose a sophisticated noise model for dealing with noisy labelling in the task of keypoint matching. While we also utilize aleatoric uncertainty for dealing with label noise, our approach is formulated for handling multiple noisy label distributions at the same time with minimal computational overhead, unlike [54].

3 Methodology

Section 3.1 defines our warping module, Sect. 3.2 describes the label refinement module, and Sect. 3.3 describes our noisy label learning approach.

Problem Formulation: Take $D_{I}$ to represent an image set and the corresponding labels $D_{L}$. We also have a set of images $D_{Iseq}$ consisting of images temporally adjacent to images in $D_{I}$. Our goal is to generate the corresponding set of pseudo-labels, ${\hat{D}}_{Lseq}$. Utilizing our new pseudo-labels, we can train the network on $D_{I}$ and $D_{Iseq}$, using corresponding labels from $D_{L}$ and ${\hat{D}}_{Lseq}$.

3.1 Label Warping

The goal of this step is to generate warped labels $ \{{\hat{L}}^{w}_{n} | t\le n \le t + p\}$, where $p \in {\mathbb {N}}$ is a fixed integer, given the sequence of images $\{I_{n}| t\le n \le t + p\}$, and the annotated label $L_{t}$. In this step, we utilize a previous existing method [22] for generating the warped labels.

The artificial neural network-mechanism $f_{\theta }$ is trained such that, for the set of images $I_{t:t+x} = \{I_{n}| t\le n \le t + x\}$ it predicts parameters for the sampling and warping function $ \Phi $ to construct $I_{t+x}$ from $I_{t + x -1}$. The pseudo-labels corresponding to $I_{t+x}$ can similarly be constructed by using the same function $\Phi $ on ${\hat{L}}^{w}_{t+x-1}$. For the sake of simplicity, we rewrite $I_{t:t+x}$ as the image-pair $(I_{t+x-1},I_{t+x})$. This step can be simply summarized as:

$$\begin{aligned}{} & {} \Phi _{(t+x-1, t+x)} = f_\theta (I_{t+x-1}, I_{t+x}), \end{aligned}$$

(1)

$$\begin{aligned}{} & {} \quad {\hat{I}}^{w}_{t+x} = \Phi _{(t+x-1, t+x)} ( {\hat{I}}^{w}_{t+x-1}) \quad ;\nonumber \\{} & {} \quad {\hat{L}}^{w}_{t+x} = \Phi _{(t+x-1, t+x)} ( {\hat{L}}^{w}_{t+x-1}), \end{aligned}$$

(2)

where, we use ${\hat{L}}^{w}_t = L_t$ and ${\hat{I}}^{w}_t = I_t$. By using this method sequentially for $1 \le x \le p$, we can generate the warped labels ${\hat{L}}^{w}_{n}$ for all $t \le n \le t + p$. This method exploits geometric information by predicting motion vectors for each pixel. A more detailed explanation can be found in [22, 44]. This method is prone to generating noisy labels due to errors in the estimation of the warping function $\Phi $.

Mask Inpainting: We first introduce a simple post-processing step (named $\texttt{MI}$) to enhance the warped labels, the goal being to identify regions where warping has failed, and to replace labels for such regions with other approximation strategies. We measure the per-pixel gap separating ${\hat{I}}$ and I at each pixel (x, y), and if it exceeds the value $\tau $, there is a need to replace those labels with labels from a semantic segmentation network S trained using $D_I$ and $D_L$. This is summarized as:

$$\begin{aligned} \begin{aligned} M(x, y)&= {\mathbb {I}}_{M}\big (d({\hat{I}}(x,y),I(x,y) ) \le \tau \big ) \\ {\hat{L}}^{MI}&= {\hat{L}}^{w} \odot M + (1-M) \odot {\hat{L}}^{pred} \end{aligned} \end{aligned}$$

(3)

for which d is a measure of distance, $\tau \in {\mathbb {R}}$ is a fixed threshold, ${\mathbb {I}}_{M}$ is the indicator function, $L^{pred}$ represents the predicted labels from $S_\theta $, and $\odot $ is the element-wise multiplication operator. While the labels are significantly improved after post-processing with $\texttt{MI}$, they still contain noise, which burgeons as we propagate further. Additionally, the inpainted labels can frequently be wrong for classes where the segmentation network $S_\theta $ fails. For the sake of simplicity, we combine the steps (2) and (3), and represent warping followed by post-processing with $\texttt{MI}$, as $\Phi ^{MI}$.

3.2 Label Refinement

The warped post-processed labels ${\hat{L}}^{MI}$ need to be refined as they are still noisy. We do this by training a label refinement network ($\texttt {LRN}_\theta $), parameterized by $\theta $, which takes pseudo-labels, the warped images, and pristine images, as input, before forecasting the refined labels ${\hat{L}}^{R}$:

$$\begin{aligned} {\hat{L}}^{R} = \texttt {LRN}_{\theta }({\hat{L}}^{MI}, {\hat{I}}^{w}, I) \end{aligned}$$

(4)

$\texttt {LRN}_\theta $ can be viewed as a denoising network, which takes the noisy samples ${\hat{L}}^{MI}$ and tries to predict the clean samples L. To train any denoising network, we typically need noisy-clean sample pairs. However, for ${t < n \le t + p}$ we do not contain clean labels $L_{n}$ (as we do not have $D_{Lseq}$). Due to this, we do not have the noisy-clean samples $(L^{MI}, L)$ for training our refinement module. Hence, while using $\texttt{LRN}_\theta $ is fairly simple, training it is non-trivial.

Cycle Consistency of Labels: With an ideal propagation mechanism, when a label $L_{t}$ is propagated through a cyclic loop in time (say t to $t+1$ and then back to t), the resulting cyclic propagated labels (denoted by ${\hat{L}}^{\circ }_{t}$) should be consistent with initial labels $L_{t}$. Therefore, the inconsistency between ${\hat{L}}^{\circ }_{t}$ and $L_t$ reveal the modes of failure of the propagation mechanism. We utilize this inconsistency as the supervisory signal to train $\texttt{LRN}_\theta $. First, we define the cyclic propagated labels ${\hat{L}}^{\circ }$ for a simple cyclic loop in time:

$$\begin{aligned}&\Phi _{(t, t+1)} = f_\theta (I_t, I_{t+1}), \end{aligned}$$

(5)

$$\begin{aligned}&{\hat{I}}^{w}_{t+1} = \Phi _{(t, t+1)} (I_{t}) \quad ; \quad {\hat{L}}^{MI}_{t+1} = \Phi ^{MI}_{(t, t+1)} ( {\hat{L}}^{w}_{t}) \end{aligned}$$

(6)

$$\begin{aligned}&\Phi _{(t+1, t)} = f_\theta ({\hat{I}}^{w}_{t+1}, I_{t}), \end{aligned}$$

(7)

$$\begin{aligned}&{\hat{I}}^{\circ }_{t} = \Phi _{(t+1, t)} ({\hat{I}}^{w}_{t+1}) \quad ; \quad {\hat{L}}_{t}^{\circ } = \Phi ^{MI}_{(t+1, t)} ( {\hat{L}}^{MI}_{t+1}) \end{aligned}$$

(8)

These cyclic warped labels ${\hat{L}}_{t}^{\circ }$ contain artifacts created due to the warping process further exacerbated due to the multiple applications of the warping function $\Phi ^{MI}$. Motivated by the concept of cycle consistency of labels, we utilize the pairs $( {\hat{L}}_{t}^{\circ }, L_{t})$ as the noisy-clean samples for training $\texttt{LRN}_\theta $:

$$\begin{aligned} {\hat{L}}^{R} = \texttt {LRN}_\theta ({\hat{L}}^{\circ }, {\hat{I}}^{\circ }, I) \quad ; \quad \theta ^{*} = \underset{\theta }{\textrm{argmin}}\;{ {\mathbb {E}}({\mathcal {L}}(L^{R}, L))}, \end{aligned}$$

(9)

where ${\mathcal {L}}$ can be any standard loss function such as cross-entropy. In the process of learning to improve consistency between the cyclic warped labels ${\hat{L}}^{\circ }$ and L, $\texttt {LRN}_\theta $ also learns to refine single warped labels $L^{MI}$. It is important to note that cyclic labels, which capture the noise of the warping process, exist because $\Phi _{(t+1, t)} \ne \Phi ^{-1}_{(t, t+1)}$.

Equation (8) represents the cyclic labels generated from a single forward and backward step. However, it is possible to perform multiple forward and backward steps, generating multiple $L^{\circ }_{t}$ for each $L_t$. This allows us to capture more diverse artifacts created due to $\Phi ^{MI}$. Figure 2 shows multiple cyclic warped samples for a given label $L_t$.

Once the refinement module is trained with the cyclic propagated label pairs, we propagate labels by a) warping, b) post-processing, and then finally c) refining with $\texttt{LRN}_\theta $ to generate refined labels ${\hat{L}}^R_{t+p}$ [refer Eq. (4)]. This is the complete pipeline of Warp-Refine Propagation. At each time step $t+p$, the labels ${\hat{L}}^R_{t+p}$ undergo propagation to generate ${\hat{L}}^R_{t+p+1}$.

3.3 Uncertainty Aware Training

As shown in Fig. 2, the Warp-Refine Propagation pipeline generates pseudo-labels of high quality. While these are directly beneficial for training, we note that using labels propagated over a large temporal distance (say $t+10$) can lead to a drop in performance. This is due to the inherent noise in the pseudo-labels. It can be significantly advantageous if we can handle the noise, as psuedo-labels further away from the annotated frame contain novel information.

Formally, let us denote labels generated from a given noisy data-generation process $\delta $ as samples of distribution $P_{\delta }(y|x)$, whereas the underlying label distribution is P(y|x). Typically, using model $M_{\theta }(y|x)$ parameterized by $\theta $, we estimate the underlying label distribution P(y|x) after attaining a maximum value for the possible log-likelihood pertaining to a model over the given data:

$$\begin{aligned} \begin{aligned} \theta ^{*} =\ {}&\underset{\theta }{\textrm{argmax}} \big [ {\mathbb {E}}_{y\sim P(y|x)} \log M_{\theta }(y|x) \big ] \\&\sim \underset{\theta }{\textrm{argmax}} \sum _{i \in D}{\log M_\theta (y_i|x_i)} \end{aligned} \end{aligned}$$

(10)

where $\theta ^{*}$ represents the optimal parameters for $M_\theta $. Since we contain noisy samples from $P_{\delta }$, our model is biased to model $P_{\delta }$, rather than P. To address the distributional shift between $P_\delta $ and P we modify the objective of our optimization. Let us consider the relation between noisy labels and clean labels as $P(y_\delta |x,y)$, where $y_\delta $ represents the noisy sample for a given x. Since we want $M_\theta $ to model the underlying label distribution P, we can rewrite our estimate $P_\theta $ for noisy labels $y_\delta $, and the corresponding objective as:

$$\begin{aligned}{} & {} P(y_\delta |x) = \sum _{y'}P(y_\delta | x, y')P(y=y'|x) \quad ; \quad P_\theta (y_\delta |x) = \sum _{y'}(P(y_\delta | x, y'))M_\theta (y=y'|x) \nonumber \\ \end{aligned}$$

(11)

$$\begin{aligned}{} & {} \quad \theta ^{*} \sim \underset{\theta }{\textrm{argmax}} \big [ \sum _{y_\delta \sim P_\delta }{\log P_\theta (y_\delta | x)} \big ] \end{aligned}$$

(12)

Theorem 1

Let $\epsilon = 1-\min _{y'}P(y_\delta =y' | x, y')$. If $\epsilon < 0.5$, then the inequality shown below is applicable to the distributions $P(y_\delta |x)$ and $P_\theta (y_\delta |x) $ defined in Eq.(11).

$$\begin{aligned}&d_{TV}(P(y|x), M_\theta (y|x)) \le \frac{1}{1-2\epsilon } \end{aligned}$$

(13)

$$\begin{aligned}&\quad \left( \sqrt{2KL[P(y_\delta |x)|P_\theta (y_\delta |x)]} + \gamma \right) . \end{aligned}$$

(14)

where $d_{TV}(p(y),q(y))$ is the total variation distance and KL[p(y)|q(y)] is called the Kullback–Leibler (KL) divergence.

Therefore, our objective (12) which minimizes KL divergence between $P(y_\delta |x)$ and $P_\theta (y_\delta |x)$, lowers the total variation gap separating P(y|x) and $M_\theta (y|x)$ as well (Fig. 3). Proof is provided in the supplementary.

Note that our formulation is independent of the labelling process $\delta $, and hence can be used for multiple labelling processes $\delta _j; j \in {\mathbb {N}}$. Now, we model $P_\delta $ and as a noisy version of P. Taking inspiration from [28], we represent $P_\theta (y_{\delta _j} | x)$ as:

$$\begin{aligned} P_\theta (y_{\delta _j} = k | x)&= E_{a^{i}_{j,k} \sim {\mathcal {N}}(\mu ^i_k(x), \sigma _{\delta _j}^i(x))}[{{\,\textrm{Softmax}\,}}(a^{i}_{j,k})] \end{aligned}$$

(15)

where ${{\,\textrm{Softmax}\,}}(a_k) = \exp (a_k)/\sum _{k'}\exp (a_{k'})$ is the softmax function. We adapt the model parameter $\theta $ to model the statistics $(\mu ^i(x), \sigma _{\delta _0}^i(x),\sigma _{\delta _1}^i(x),.. )$ for each pixel i in a given image x. The optimization objective can now be written as:

$$\begin{aligned} \theta ^{*} = \underset{\theta }{\textrm{argmax}} \Bigg [ \sum _{j}{\sum _{i \in D_{{\delta _j}}}{\log P_\theta (y_{\delta _j} | x)}} \Bigg ] \end{aligned}$$

(16)

Due to the lack of analytical solution for $ P_\theta (y_{\delta _j} | x)$, we approximate the objective by Monte Carlo integration as described in [28]. As ${{\,\textrm{Softmax}\,}}(\mu ^i)$ models the underlying distribution P(y|x), at test-time, our predictions for a given image is simply $\mu ^i$ for each pixel i in the image.

In practice, this objective translates to modelling separate aleatoric uncertainty components for each of the noisy labelling processes $\delta _j$. The aleatoric uncertainty is modelled by adding a small two-layer head to the given segmentation models to predict $\sigma _{\delta _j}$.

Using Pseudo Labels: We also utilize pseudo-semantic labelling [18, 19] for data points where label propagation is not possible. For Cityscapes [14], we use predictions from a model $S_\theta $ trained utilizing only the labels $D_L$, to create dataset $D_{{\hat{L}}ps}$ (specifically, labels for the train_extra subset of Cityscapes). As our training procedure is agnostic to the label generation process, we can use the different labels by simply modelling a separate uncertainty parameter $\sigma _{\delta _j}$ for each of them.

Table 1 Witness benefits of training with this study’s pseudo-labels

Full size table

4 Experiments

4.1 Implementation Details

In the process of training our label refinement network ($\texttt{LRN}_\theta $), we employ a specific loss function known as the dual-task loss [56]. This loss function is a critical component in guiding the training process of $\texttt{LRN}_\theta $, contributing to its ability to refine labels effectively.

In terms of training semantic segmentation networks, which are responsible for assigning labels to different parts of an image, we establish our baseline training approach by adopting the methodology outlined in [22]. This baseline serves as the foundation for our training strategy. However, we make a deliberate departure from this approach by not utilizing the Relaxed Label Loss as proposed in [22], which is explained in Sect. 4.3 of our work. Instead, we opt for a training duration of 220 epochs across all our models.

Our network architecture is primarily built upon the well-regarded DeepLabV3+ model [57], a potent solution for semantic segmentation tasks. For specific analyses, we introduce a ResNeXt [58] backbone into the architecture, enhancing the model’s capability to learn intricate features. This variation, referred to as “ablations”, systematically evaluates the impact of specific components. However, for submissions on the test set, we employ a different architecture known as WideResNet38 [59], which is optimized for generating accurate results in this context. Regarding the change in backbone architectures, it’s important to clarify the rationale behind this decision. The shift from ResNeXt to WideResNet38 is made to enhance performance specifically for test set submissions. This change suggests that the ablation experiments were not as effective when applied to the WideResNet38 backbone, leading to a preference for the WideResNet38 architecture in this specific context.

Dataset Details: The ApolloScape dataset [15] is an extensive collection comprising 143,906 images that have been meticulously annotated. However, in order to make our research manageable and focused, we take steps to create specific subsets. Initially, we craft two fundamental subsets: a training subset containing 40,100 images and a validation subset encompassing 6113 images. This segmentation enables us to systematically analyze our models. Within the training subset, we further establish two distinct partitions: $D_I$ and $D_{Iseq}$. The $D_I$ subset is composed of 2005 images, while $D_{Iseq}$ comprises a more extensive collection of 40,100 images. This division is carried out by creating continuous sequences of 21 frames each.Importantly, the ApolloScape dataset is noteworthy for its comprehensive annotations, encompassing annotations for all frames. This characteristic allows us to establish a reference point known as the ground truth label set $D_{Lseq}$. This set is a compilation of annotations that are considered pristine and accurate, constituting a benchmark for comparison. As part of our evaluation process, we subject models to scrutiny that have undergone training with both "clean" annotations (those present in the original dataset) and "propagated" annotations (annotations generated through our techniques). The evaluation is conducted on a validation subset that remains untouched and separate throughout the training process, providing an unbiased assessment of model performance.

Shifting our attention to the Cityscapes dataset [14], a separate dataset we incorporate, it encompasses 5000 annotated images. To ensure a systematic analysis, we divide this dataset into distinct segments. The $D_I$ subset, with 2975 images, serves as our training set. Additionally, we allocate subsets for validation (comprising 500 images) and testing (encompassing 1525 images) purposes. Within the $D_I$ subset, we identify two specific subcategories: $D_{Iseq}$ and $D_{Ips}$. The $D_{Iseq}$ subset involves sequential images devoid of annotations, allowing us to explore scenarios where annotations are absent. On the other hand, $D_{Ips}$ contains annotations of a coarse nature, providing a different aspect of data quality. An intriguing aspect of our approach involves the handling of rough labels. In instances where such labels are present, we choose to discard them. Instead, we embrace a technique called pseudo-labelling to create annotations for the subset referred to as $D_{{\hat{L}}ps}$.

4.2 Evaluating Label Propagation

We quantitatively establish this study’s methodology is better at propagating label-mechanisms with significantly lesser noise than comparable techniques. Using the clean annotation $D_{Lseq}$ in Apolloscape we evaluate different the propagation methods. In the ApolloScape dataset, from the manually annotated labels in $D_L$, we generate the approximated labels $D_{{\hat{L}}seq}$ for each propagation technique, and evaluate it against the given annotated labels $D_{Lseq}$.

Figure 5 describes the mean Intersection over Union (mIoU) of different propagation methods at each propagation length. This is evaluated on the sequences adjacent to the training set itself, as label propagation is also conducted on those sequences only. We compare against other comparable methods [22], and predictions from a segmentation model $\texttt{S}_\theta $ trained on $D_L$ only. Our label propagation method surpasses the other methods, and as shown, [22] quickly start performing even worse than $\texttt{S}_\theta $.

4.3 Learning with Generated Labels

We evaluate the improvement in semantic segmentation by applying generated labels to our training process and noting mean values in each metric over three different runs.

First, we present results countering the claims of the previous work [22]. Specifically, we find [22] to be ineffective in improving semantic segmentation. The baseline reported in [22] is trained for one-third the iterations with a suboptimal learning rate (Table 2). By equalizing the number of training iterations, and increasing learning rate, we find that the baseline is able to match the proposed models from [22].

Table 2 State-of-the-art methods on the Cityscapes dataset (test partition)

Full size table

Table 3 We show the benefit of training with our pseudo-labels

Full size table

To benefit from propagated data, we find it essential to include propagated samples from multiple timesteps. Following [24], for each label in $D_L$, we include propagated labels at timesteps $t \pm p$ where $p \in \{2,4,6,8\}$ from $D_{{\hat{L}}seq}$. Furthermore, we include pseudo-labelling on the train_extra subset $D_{{\hat{L}}ps}$ as well. Table 3 displays results. The propagated labels, when used with the Uncertainty-Aware Training (UAT), enable us to boost the performance 0.49 mIoU. The performance is further boosted by 0.47 mIOU when $D_{{\hat{L}}ps}$ is used as well. Note how UAT is critical in increasing performance in the presence of propagated labels.

Finally, using the same method, it is observed to improve upon the results from methods considered state-of-the-art on Cityscapes (when training with fine labels only). Table3 shows the improvement by training with our method. We observe that in the presence of coarse labels, and Mapillary Vistas pretraining [66], the benefits of label propagation are not clear. This is expected as label propagation cannot be performed for the coarse labels and the Mapillary Vistas dataset and hence, in the presence of those labels, propagation is performed for only $\sim $ 10% of the entire dataset.

Evaluation on ApolloScape: In Table 1 we show the benefit of training with propagated labels and the uncertainty-aware training regime on Apolloscape dataset.

Modelling Label Noise with Uncertainty: We demonstrate that the uncertainty estimates are able to model the noise in the data-generation process.

We rigorously evaluate the impact of incorporating generated labels on semantic segmentation performance. Contrary to a previous study [22], we demonstrate that their baseline approach falls short due to inadequate training parameters. By aligning training iterations and improving learning rates, we establish parity with their proposed models. Leveraging propagated labels from various time steps and applying Uncertainty-Aware Training (UAT) leads to a noteworthy 0.49 mIoU enhancement. Extending this with pseudo-labelling on the $D_{{\hat{L}}ps}$ subset further boosts performance by 0.47 mIoU. The pivotal role of UAT in improving performance is evident, especially in the presence of propagated labels. This strategy also outperforms state-of-the-art results on the Cityscapes dataset and yields similar gains on the ApolloScape dataset, as detailed in Tables 1 and 3. Furthermore, we showcase how uncertainty estimates effectively model noise in data generation through precision-recall curves in Fig. 4, underscoring the accuracy of our approach.

5 Conclusion

Our work addresses two questions: propagating labels, and training with noisy pseudo-labels. Our propagation method utilizes the concept of cycle consistency of labels to significantly improve label propagation. Further, our noisy label learning approach effectively utilizes uncertainty and ameliorates drawbacks chanced upon when training with coarse labels. Our techniques achieve enhanced outcomes when run by Cityscapes. We can conclude that noisy label learning approach opens the door to utilizing more noisy augmentation processes such as image-based rendering methods , and in later endeavors, we hope to explore in said direction.

6 Potential Broader Impact

This research can be beneficial to companies or institutions requiring applications for semantic understanding of video data such as autonomous driving. If autonomous driving becomes an ubiquitous reality, humans currently working with manual car driving labor could be at disadvantage from the results of this research, although indirectly. If an autonomous driving system fails due to the proposed component, the consequence could be an accident or the loss of human life. We believe our proposed method leverages some bias in the data, as the distribution within the set meant for training purposes will influence representations that can be picked up by a downstream semantic image segmentation network.

References

Mani V (2023) Deep learning models for semantic multi-modal medical image segmentation, In: Research anthology on improving medical imaging techniques for analysis and intervention. IGI Global, pp. 107–125
Qin Z, Lu X, Nie X, Liu D, Yin Y, Wang W (2023) Coarse-to-fine video instance segmentation with factorized conditional appearance flows. IEEE/CAA J Autom Sinica 10(5):1192–1208
Article Google Scholar
Das A, Xian Y, He Y, Akata Z, Schiele B (2023) Urban scene semantic segmentation with low-cost coarse annotation, In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 5978–5987
Laraqui A, Azmi K, Laraqui M, Boussedra F (2023) Stitched image based on a real-time video conversion technique. Int J Comput Aid Eng Technol 18(1–3):141–151
Article Google Scholar
Touvron H, Vedaldi A, Douze M, Jégou H (2019) Fixing the train-test resolution discrepancy, In: Advances in neural information processing systems (NeurIPS)
Zhang H, Wu C, Zhang Z, Zhu Y, Zhang Z, Lin H, Sun Y, He T, Muller J, Manmatha R, Li M, Smola A (2020) Resnest: split-attention networks, arXiv preprint arXiv:2004.08955,
Kolesnikov A, Beyer L, Zhai X, Puigcerver J, Yung J, Gelly S, Houlsby N (2019) Big transfer (bit): general visual representation learning
Jeong J, Lee S, Kim J, Kwak N (2019) Consistency-based semi-supervised learning for object detection. In: Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, (Eds). Advances in neural information processing systems 32. Curran Associates, Inc., pp. 10759–10768
Zhao S, Li B, Yue X, Gu Y, Xu P, Hu R, Chai H, Keutzer K (2019) Multi-source domain adaptation for semantic segmentation. In: Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, (Eds). Advances in Neural Information Processing Systems 32, Curran Associates, Inc., pp. 7287–7300
Zhou Y, He X, Huang L, Liu L, Zhu F, Cui S, Shao L (2019) Collaborative learning of semi-supervised segmentation and classification for medical images, In: The IEEE conference on computer vision and pattern recognition (CVPR), June
Zhan X, Liu Z, Luo P, Tang X, Loy CC (2018) Mix-and-match tuning for self-supervised semantic segmentation. In: AAAI Conference on Artificial Intelligence (AAAI), February
Larsson M, Stenborg E, Toft C, Hammarstrand L, Sattler T, Kahl F (2019) Fine-grained segmentation networks: Self-supervised segmentation for improved long-term visual localization, In: The IEEE international conference on computer vision (ICCV), October
Benenson R, Popov S, Ferrari V (2019) Large-scale interactive object segmentation with human annotators, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 11700–11709
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Wang P, Huang X, Cheng X, Zhou D, Geng Q, Yang R (2019) The apolloscape open dataset for autonomous driving and its application. In: IEEE Transactions on pattern analysis and machine intelligence
Chang MF, Lambert JW, Sangkloy P, Singh J, Bak S, Hartnett A, Wang D, Carr P, Lucey S, Ramanan D, Hays J (2019) Argoverse: 3d tracking and forecasting with rich maps. In: Conference on computer vision and pattern recognition (CVPR)
Ke R, Bugeau A, Papadakis N, Schuetz P, Schönlieb C-B (2019) A multi-task u-net for segmentation with lazy labels
Zamir AR, Sax A, Shen WB, Guibas LJ, Malik J, Savarese S (2018) Taskonomy: disentangling task transfer learning. In: IEEE conference on computer vision and pattern recognition (CVPR). IEEE
Zhang Q, Zhang J, Liu W, Tao D (2019) Category anchor-guided unsupervised domain adaptation for semantic segmentation, In: Wallach H, Larochelle H, Beygelzimer A, d’ Alché-Buc F, Fox E, Garnett R, (Eds). Advances in Neural Information Processing Systems 32, Curran Associates, Inc., pp. 435–445
Zhu X, Ghahramani Z (2002) Learning from labeled and unlabeled data with label propagation
Badrinarayanan V, Galasso F, Cipolla R (2010) Label propagation in video sequences. In: IEEE Computer society conference computer vision pattern recognition, pp. 3265–3272
Zhu Y, Sapra K, Reda FA, Shih K J, Newsam S, Tao A, Catanzaro B (2019) Improving semantic segmentation via video propagation and label relaxation. In: The IEEE conference on computer vision and pattern recognition (CVPR), June
Budvytis I, Sauer P, Roddick T, Breen K, Cipolla R (2017) Large scale labelled video data augmentation for semantic segmentation in driving scenarios, In: 5th Workshop on computer vision for road scene understanding and autonomous driving in IEEE international conference on computer vision (ICCV), October
(2016) Can ground truth label propagation from video help semantic segmentation? In: Computer vision - ECCV 2016 Workshops, Proceedings, series. Lecture notes in computer science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), G. Hua and H. Jegou, (Eds). Springer, Vol 1 , pp. 804–820
Zhang Y, Lv H, Zhao Y, Feng Y, Liu H, Bi G (2023) Event-based optical flow estimation with spatio-temporal backpropagation trained spiking neural network. Micromachines 14(1):203
Article CAS PubMed PubMed Central Google Scholar
Lu X, Wang W, Shen J, Crandall DJ, Van Gool L (2022) Segmenting objects from relational visual data. IEEE Trans Pattern Anal Mach Intell 44(11):7885–7897
Article PubMed Google Scholar
Hao F, Ma ZF, Tian HP, Wang H, Wu D (2023) Semi-supervised label propagation for multi-source remote sensing image change detection. Comput Geosci 170:105249
Article Google Scholar
Kendall A, Gal Y (2017) What uncertainties do we need in bayesian deep learning for computer vision? In: Advances in neural information processing systems 30, Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, (Eds). Curran Associates, Inc., pp. 5574–5584
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation, In: IEEE Conference on computer vision and pattern recognition (CVPR), pp. 3431–3440
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation, In: IEEE Conference on computer vision and pattern recognition, pp. 580–587
Chen L, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2015) Semantic image segmentation with deep convolutional nets and fully connected crfs. In: Y. Bengio and Y. LeCun (Eds), 3rd International conference on learning representations Conference Track Proceeding, ICLR 2015, San Diego, CA, USA, May 7–9
Chen L, Papandreou G, Schroff F, Adam H, (2017) Rethinking atrous convolution for semantic image segmentation, CoRR, arXiv:1706.05587
Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: CVPR
Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: Navab N, Hornegger J, Wells WM, Frangi AF (eds) Medical image computing and computer-assisted intervention - MICCAI 2015. Springer International Publishing, Cham, pp 234–241
Chapter Google Scholar
Lee J, Kim E, Lee S, Lee J, Yoon S (2019) Ficklenet: weakly and semi-supervised semantic image segmentation using stochastic inference, In: The IEEE conference on computer vision and pattern recognition (CVPR), June
Vu T-H, Jain H, Bucher M, Cord M, Perez P(2019) Advent: adversarial entropy minimization for domain adaptation in semantic segmentation, In: The IEEE conference on computer vision and pattern recognition (CVPR), June
Li Y, Yuan L, Vasconcelos N (2019) Bidirectional learning for domain adaptation of semantic segmentation. In: The IEEE conference on computer vision and pattern recognition (CVPR), June
Chen YC, Lin YY, Yang MH, Huang JB (2019) Crdoco: pixel-level domain transfer with cross-domain consistency. In: The IEEE conference on computer vision and pattern recognition (CVPR), June
Zhang Q, Zhang J, Liu W, Tao D (2019) Category anchor-guided unsupervised domain adaptation for semantic segmentation, In: Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, (Eds) Advances in Neural Information Processing Systems 32, Curran Associates, Inc., pp. 435–445
Zhao S, Li B, Yue X, Gu Y, Xu P, Tan Hu R, Chai H, Keutzer K (2019) Multi-source domain adaptation for semantic segmentation, In Advances in Neural Information Processing Systems
Berthelot D, Carlini N, Goodfellow I, Papernot N, Oliver A, Raffel C A (2019) Mixmatch: a holistic approach to semi-supervised learning, In: Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, (Eds). Advances in Neural Information Processing Systems 32, Curran Associates, Inc., pp. 5049–5059
Xie Q, Hovy E, Luong MT, Le QV (2019) Self-training with noisy student improves imagenet classification,” arXiv preprint arXiv:1911.04252
Badrinarayanan V, Budvytis I, Cipolla R (2013) Semi-supervised video segmentation using tree structured graphical models. IEEE Trans Pattern Anal Mach Intell 35(11):2751–2764
Article PubMed Google Scholar
Reda F, Liu G, Shih K, Kirby R, Barker J, Tarjan D, Tao A, Catanzaro B, SDC-Net: video prediction using spatially-displaced convolution: 15th European conference, Munich, Germany, September 8–14, (2018) Proceedings. Part VII 09(2018):747–763
Google Scholar
Luc P, Couprie C, LeCun Y, Verbeek J (2018) Predicting future instance segmentation by forecasting convolutional features, In: The European Conference on Computer Vision (ECCV), September
Gadde R, Jampani V, Gehler PV (2017) Semantic video CNNS through representation warping, In: The IEEE International Conference on Computer Vision (ICCV), Oct
Voigtlaender P, Chai Y, Schroff F, Adam H, Leibe B, Chen LC (2019) Feelvos: fast end-to-end embedding learning for video object segmentation, In: CVPR
Wang X, Jabri A, Efros AA (2019) Learning correspondence from the cycle-consistency of time. In: CVPR
Qin Z, Lu X, Nie X, Zhen X, Yin Y (2021) Learning hierarchical embedding for video instance segmentation. In: Proceedings of the 29th ACM international conference on multimedia, pp. 1884–1892
Reda FA, Sun D, Dundar A, Shoeybi M, Liu G, Shih KJ, Tao A, Kautz J, Catanzaro B (2019) Unsupervised video interpolation using cycle consistency. In: The IEEE International conference on computer vision (ICCV), October
Kendall YG, Cipolla R (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Yehezkel Rohekar R, Gurwicz Y, Nisimov S, Novik G (2019) Modeling uncertainty by learning a hierarchy of deep neural connections. In: Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, (Eds). Advances in Neural Information Processing Systems 32, Curran Associates, Inc., pp. 4244–4254
Thulasidasan S, Chennupati G, Bilmes JA, Bhattacharya T, Michalak S (2019) On mixup training: Improved calibration and predictive uncertainty for deep neural networks, In: Wallach H, Larochelle H, Beygelzimer A, d’ Alché-Buc F, Fox E, Garnett R (Eds). Advances in neural information processing systems 32, Curran Associates, Inc., pp. 13888–13899
Neverova N, Novotny D, Vedaldi A (2019) Correlated uncertainty for learning dense correspondences from noisy labels,” In: Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, (Eds). Advances in Neural Information Processing Systems 32, Curran Associates, Inc., pp. 920–928
Choi S, Lee K, Lim S, Oh S (2018) Uncertainty-aware learning from demonstration using mixture density networks with sampling-free variance modeling. In: 2018 IEEE International conference on robotics and automation, ICRA 2018, Brisbane, Australia, May 21–25, pp. 6915–6922
Takikawa T, Acuna D, Jampani V, Fidler S (2019), Gated-scnn: gated shape CNNS for semantic segmentation, ICCV
Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation, In: The European conference on computer vision (ECCV), September
Xie S, Girshick R, Dollar P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks, In: The IEEE conference on computer vision and pattern recognition (CVPR), July
Wu Z, Shen C, van den Hengel A (2019) Wider or deeper: revisiting the ResNet model for visual recognition. Patt Recognit 90:119–133
Article ADS Google Scholar
Yang M, Yu K, Zhang C, Li Z, Yang K (2018) DenseASPP for semantic segmentation in street scenes. In: IEEE conference on computer vision and pattern recognition (CVPR)
Cheng B, Chen LC, Wei Y, Zhu Y, Huang Z, Xiong J, Huang TS, Hwu WM, Shi H (2019) SPGNet: semantic prediction guidance for scene parsing. In: IEEE International conference on computer vision (ICCV),
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: IEEE conference on computer vision and pattern recognition (CVPR)
Sun K, Zhao Y, Jiang B, Cheng T, Xiao B, Liu D, Mu Y, Wang X, Liu W, Wang J (2019) High-Resolution Representations for Labeling Pixels and Regions, arXiv preprint arXiv:1904.04514
Li X, Zhong Z, Wu J, Yang Y, Lin Z, Liu H (2019) Expectation-maximization attention networks for semantic segmentation. In: IEEE International conference on computer vision (ICCV)
Fu J, Liu J, Wang Y, Li Y, Bao Y, Tang J, Lu H (2019) Adaptive context network for scene parsing. In: IEEE International Conference on Computer Vision (ICCV)
Neuhold G, Ollmann T, Rota Bulò S, Kontschieder P (2017) The mapillary vistas dataset for semantic understanding of street scenes. In: International Conference on Computer Vision (ICCV)

Download references

Author information

Authors and Affiliations

Amrita Vishwa Vidyapeetham, Kollam, Kerala, India
Veerababu Addanki, Dhanvanth Reddy Yerramreddy, Sathvik Durgapu & Sasi Sai Nadh Boddu
SASTRA University, Thanjavur, Tamil Nadu, India
Vyshnav Durgapu

Authors

Veerababu Addanki
View author publications
You can also search for this author in PubMed Google Scholar
Dhanvanth Reddy Yerramreddy
View author publications
You can also search for this author in PubMed Google Scholar
Sathvik Durgapu
View author publications
You can also search for this author in PubMed Google Scholar
Sasi Sai Nadh Boddu
View author publications
You can also search for this author in PubMed Google Scholar
Vyshnav Durgapu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dhanvanth Reddy Yerramreddy.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Addanki, V., Yerramreddy, D.R., Durgapu, S. et al. Enhancing Semi Supervised Semantic Segmentation Through Cycle-Consistent Label Propagation in Video. Neural Process Lett 56, 4 (2024). https://doi.org/10.1007/s11063-024-11459-6

Download citation

Accepted: 25 November 2023
Published: 06 February 2024
DOI: https://doi.org/10.1007/s11063-024-11459-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Enhancing Semi Supervised Semantic Segmentation Through Cycle-Consistent Label Propagation in Video

Abstract

Similar content being viewed by others

U-Net: Convolutional Networks for Biomedical Image Segmentation

Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation

UNet++: A Nested U-Net Architecture for Medical Image Segmentation

1 Introduction

2 Related Works