1 Introduction

While machine learning has enabled highly performative modeling for various computer vision tasks [5, 6], large quantities of annotated data are still necessary to achieve futuristic results [7]. This predicament is further exacerbated for tasks such as semantic image segmentation, where the extensive data annotation requirements for training can be prohibitively expensive. Various semi-supervised [8,9,10], and self-supervised methods [11, 12] have been proposed to alleviate this data-bottleneck, however, they bring their own set of challenges such as domain-shift.

While dense annotation is costly [13, 14], collecting raw sequential image data is relatively cheap. This is especially the conventional methodology for autonomous driving datasets, such as [14,15,16]. Such datasets are often sparsely annotated across time (for example, in [14], pixel-level segmentation is provided once in every continuous partition of 30 frames). To assuage the data bottleneck, we can generate approximated labels for the unlabelled samples. While weak labels can be generated via lazy labelling [17], or pseudo-semantic annotations [18, 19], the presence of an annotated sample in a continuous temporal sequence motivates the use of label propagation [20, 21] for generating the approximated labels of neighboring frames.

Fig. 1
figure 1

a Using our propagation method, we generate pseudo-labels for images \(D_{Iseq}\). The generated and the clean labels are then used for uncertainty-aware-training of semantic segmentation models. b Our propagation method fares better than conventional label propagation techniques [22]

The two primary concerns regarding label propagation are:

a) How to perform it? and

b) How to utilize the generated labels? Our work addresses both, devising a novel technique to improve label propagation, and a principled approach for training with generated noisy labels.

Previous methods [22,23,24], due to their reliance on geometric cues, have several of the same drawbacks as that of optical flow estimation [25]. Specifically, prominent features of real-world data such as the lack of brightness consistency, or large motions, introduce harmful noise in the propagated labels. This drawback is further aggravated by the accumulation of errors as labels are propagated further in time. This is crucial, as longer propagation can yield more diverse labelled data, which is more beneficial for training deep neural networks [24].

Specifically, we make two important observations w.r.t previous propagation methods [22, 24]:

1) They do not leverage the semantic knowledge in the annotated dataset, and

2) The noise in labels propagated with geometric methods has a systemic component which can be modelled.

Our strategy, called Warp-Refine Propagation, addresses both of these concerns. This method consists of two steps, a label warping step followed by a label refinement step. At the heart of our approach is the concept of cyclic consistency of labels, which we explain in detail in Sect. 3.3. Through Fig. 1, it is evident that our mechanism fares better than others.

In spite of the improvements, the propagated labels can still be noisy, especially when propagation takes place for larger time steps [26]. Thus, we formulate Uncertainty-Aware Training, a principled approach for training with noisy labels which contrasts previous heuristic methods such as label relaxation [22, 27] or loss weighting [24]. Our training process involves teaching the model to calculate the likelihood of the labels with errors, based on the actual distribution of true labels. This allows us to estimate the uncertainty of each label generation process, mitigating the drawbacks of training with noisy labels. Furthermore, our approach can be used for multiple noisy distributions at the same time. Hence, with minimal changes to our model, we are also able to use pseudo-labelling [18, 19], i.e. noisy predictions from a pretrained-model, for data points where propagation is not possible. In Sect. 3.3, we draw the link between our approach and modelling of aleatoric uncertainty [28] for different data generation processes.

By using our propagation method, and our noisy label learning approach, we attained futuristic outcomes for the extensive datasets, namely Cityscapes, and Apollo-Scape. Further, our intention remains to provide quantitative evaluation of various propagation methods which has crucially been missing from previous literature.

In summary, we

  • Improve label propagation with a new approach called Warp-Refine Propagation, and provide quantitative as well as qualitative evaluation for the same.

  • Propose Uncertainty-Aware Training, a principled approach utilizing noisy labels, for which we further provide theoretical justification.

  • Utilize our proposed advancements for attaining futuristic outcomes from two real-world datasets.

2 Related Works

The works [29, 30] were one of the first works to introduce deep architectures for the task of semantic segmentation. Multiple generations of work [31,32,33,34], have further improved the performance, albeit at the cost of more computation mode labelled data requirement. A large corpus of previous literature alleviates the data bottleneck via semi-supervised approaches [10, 35], domain adaptation [9, 36,37,38,39,40], and data augmentation [8, 41].

Works such as [42], assuage the label bottleneck via creation of pseudo-labels for a large set of unlabelled images. This pseudo-labelling utilizes the semantic knowledge in the labelled dataset. In contrast, works such as [23, 24, 43] propose label propagation, where geometric approaches such as video-prediction [44], are utilized for generating labels. In such works, the geometric knowledge is utilized for generating pseudo-labels. As far as we know, our approach is the first to incorporate both semantic and geometric comprehension to propagate labels.

The integration of semantic and geometric comprehension has been previously investigated for other tasks [45,46,47]. We stand out by ensuring the accurate distribution of labels, which requires a unique loss function based on consistent labeling throughout cycles. The idea of cyclic consistency has been previously employed to acquire knowledge about object embeddings [48, 49], and video interpolation [50]. Our work is inspired from [48], where cyclic consistency is used for learning object embeddings using a robust tracker. However, unlike [48], our route addresses the noisy nature of our tracking/geometric modeling method itself.

Previous works deal with the noise in propagated labels by defining heuristics such as trust-factor [24] or label-relaxation [22]. Our approach is grounded in principles that involve modeling the uncertainty associated with labeling. Recently, [28, 51,52,53] have explored modelling of uncertainty in the context of deep learning models. Furthermore, works such as [54, 55], have explored the relation between aleatoric uncertainty, and label noise. In [54], the authors propose a sophisticated noise model for dealing with noisy labelling in the task of keypoint matching. While we also utilize aleatoric uncertainty for dealing with label noise, our approach is formulated for handling multiple noisy label distributions at the same time with minimal computational overhead, unlike [54].

3 Methodology

Section 3.1 defines our warping module, Sect. 3.2 describes the label refinement module, and Sect. 3.3 describes our noisy label learning approach.

Problem Formulation: Take \(D_{I}\) to represent an image set and the corresponding labels \(D_{L}\). We also have a set of images \(D_{Iseq}\) consisting of images temporally adjacent to images in \(D_{I}\). Our goal is to generate the corresponding set of pseudo-labels, \({\hat{D}}_{Lseq}\). Utilizing our new pseudo-labels, we can train the network on \(D_{I}\) and \(D_{Iseq}\), using corresponding labels from \(D_{L}\) and \({\hat{D}}_{Lseq}\).

3.1 Label Warping

The goal of this step is to generate warped labels \( \{{\hat{L}}^{w}_{n} | t\le n \le t + p\}\), where \(p \in {\mathbb {N}}\) is a fixed integer, given the sequence of images \(\{I_{n}| t\le n \le t + p\}\), and the annotated label \(L_{t}\). In this step, we utilize a previous existing method [22] for generating the warped labels.

The artificial neural network-mechanism \(f_{\theta }\) is trained such that, for the set of images \(I_{t:t+x} = \{I_{n}| t\le n \le t + x\}\) it predicts parameters for the sampling and warping function \( \Phi \) to construct \(I_{t+x}\) from \(I_{t + x -1}\). The pseudo-labels corresponding to \(I_{t+x}\) can similarly be constructed by using the same function \(\Phi \) on \({\hat{L}}^{w}_{t+x-1}\). For the sake of simplicity, we rewrite \(I_{t:t+x}\) as the image-pair \((I_{t+x-1},I_{t+x})\). This step can be simply summarized as:

$$\begin{aligned}{} & {} \Phi _{(t+x-1, t+x)} = f_\theta (I_{t+x-1}, I_{t+x}), \end{aligned}$$
(1)
$$\begin{aligned}{} & {} \quad {\hat{I}}^{w}_{t+x} = \Phi _{(t+x-1, t+x)} ( {\hat{I}}^{w}_{t+x-1}) \quad ;\nonumber \\{} & {} \quad {\hat{L}}^{w}_{t+x} = \Phi _{(t+x-1, t+x)} ( {\hat{L}}^{w}_{t+x-1}), \end{aligned}$$
(2)

where, we use \({\hat{L}}^{w}_t = L_t\) and \({\hat{I}}^{w}_t = I_t\). By using this method sequentially for \(1 \le x \le p\), we can generate the warped labels \({\hat{L}}^{w}_{n}\) for all \(t \le n \le t + p\). This method exploits geometric information by predicting motion vectors for each pixel. A more detailed explanation can be found in [22, 44]. This method is prone to generating noisy labels due to errors in the estimation of the warping function \(\Phi \).

Mask Inpainting: We first introduce a simple post-processing step (named \(\texttt{MI}\)) to enhance the warped labels, the goal being to identify regions where warping has failed, and to replace labels for such regions with other approximation strategies. We measure the per-pixel gap separating \({\hat{I}}\) and I at each pixel (xy), and if it exceeds the value \(\tau \), there is a need to replace those labels with labels from a semantic segmentation network S trained using \(D_I\) and \(D_L\). This is summarized as:

$$\begin{aligned} \begin{aligned} M(x, y)&= {\mathbb {I}}_{M}\big (d({\hat{I}}(x,y),I(x,y) ) \le \tau \big ) \\ {\hat{L}}^{MI}&= {\hat{L}}^{w} \odot M + (1-M) \odot {\hat{L}}^{pred} \end{aligned} \end{aligned}$$
(3)

for which d is a measure of distance, \(\tau \in {\mathbb {R}}\) is a fixed threshold, \({\mathbb {I}}_{M}\) is the indicator function, \(L^{pred}\) represents the predicted labels from \(S_\theta \), and \(\odot \) is the element-wise multiplication operator. While the labels are significantly improved after post-processing with \(\texttt{MI}\), they still contain noise, which burgeons as we propagate further. Additionally, the inpainted labels can frequently be wrong for classes where the segmentation network \(S_\theta \) fails. For the sake of simplicity, we combine the steps (2) and (3), and represent warping followed by post-processing with \(\texttt{MI}\), as \(\Phi ^{MI}\).

Fig. 2
figure 2

a For training \(\texttt {LRN}_\theta \), the cyclic propagated labels \(\hat{L}^{\circ }\) are used as input, as given by Eq. (5). b While generating labels, the warped post-processed labels \(\hat{L}^{MI}\) are given as input to \(\texttt {LRN}_\theta \). Iterative application of warping and refining is used to generate sequential labels. c The cyclic warped labels are generated for a wider range of cyclic loops to expose \(\texttt {LRN}_\theta \) to more warping error while training

3.2 Label Refinement

The warped post-processed labels \({\hat{L}}^{MI}\) need to be refined as they are still noisy. We do this by training a label refinement network (\(\texttt {LRN}_\theta \)), parameterized by \(\theta \), which takes pseudo-labels, the warped images, and pristine images, as input, before forecasting the refined labels \({\hat{L}}^{R}\):

$$\begin{aligned} {\hat{L}}^{R} = \texttt {LRN}_{\theta }({\hat{L}}^{MI}, {\hat{I}}^{w}, I) \end{aligned}$$
(4)

\(\texttt {LRN}_\theta \) can be viewed as a denoising network, which takes the noisy samples \({\hat{L}}^{MI}\) and tries to predict the clean samples L. To train any denoising network, we typically need noisy-clean sample pairs. However, for \({t < n \le t + p}\) we do not contain clean labels \(L_{n}\) (as we do not have \(D_{Lseq}\)). Due to this, we do not have the noisy-clean samples \((L^{MI}, L)\) for training our refinement module. Hence, while using \(\texttt{LRN}_\theta \) is fairly simple, training it is non-trivial.

Cycle Consistency of Labels: With an ideal propagation mechanism, when a label \(L_{t}\) is propagated through a cyclic loop in time (say t to \(t+1\) and then back to t), the resulting cyclic propagated labels (denoted by \({\hat{L}}^{\circ }_{t}\)) should be consistent with initial labels \(L_{t}\). Therefore, the inconsistency between \({\hat{L}}^{\circ }_{t}\) and \(L_t\) reveal the modes of failure of the propagation mechanism. We utilize this inconsistency as the supervisory signal to train \(\texttt{LRN}_\theta \). First, we define the cyclic propagated labels \({\hat{L}}^{\circ }\) for a simple cyclic loop in time:

$$\begin{aligned}&\Phi _{(t, t+1)} = f_\theta (I_t, I_{t+1}), \end{aligned}$$
(5)
$$\begin{aligned}&{\hat{I}}^{w}_{t+1} = \Phi _{(t, t+1)} (I_{t}) \quad ; \quad {\hat{L}}^{MI}_{t+1} = \Phi ^{MI}_{(t, t+1)} ( {\hat{L}}^{w}_{t}) \end{aligned}$$
(6)
$$\begin{aligned}&\Phi _{(t+1, t)} = f_\theta ({\hat{I}}^{w}_{t+1}, I_{t}), \end{aligned}$$
(7)
$$\begin{aligned}&{\hat{I}}^{\circ }_{t} = \Phi _{(t+1, t)} ({\hat{I}}^{w}_{t+1}) \quad ; \quad {\hat{L}}_{t}^{\circ } = \Phi ^{MI}_{(t+1, t)} ( {\hat{L}}^{MI}_{t+1}) \end{aligned}$$
(8)

These cyclic warped labels \({\hat{L}}_{t}^{\circ }\) contain artifacts created due to the warping process further exacerbated due to the multiple applications of the warping function \(\Phi ^{MI}\). Motivated by the concept of cycle consistency of labels, we utilize the pairs \(( {\hat{L}}_{t}^{\circ }, L_{t})\) as the noisy-clean samples for training \(\texttt{LRN}_\theta \):

$$\begin{aligned} {\hat{L}}^{R} = \texttt {LRN}_\theta ({\hat{L}}^{\circ }, {\hat{I}}^{\circ }, I) \quad ; \quad \theta ^{*} = \underset{\theta }{\textrm{argmin}}\;{ {\mathbb {E}}({\mathcal {L}}(L^{R}, L))}, \end{aligned}$$
(9)

where \({\mathcal {L}}\) can be any standard loss function such as cross-entropy. In the process of learning to improve consistency between the cyclic warped labels \({\hat{L}}^{\circ }\) and L, \(\texttt {LRN}_\theta \) also learns to refine single warped labels \(L^{MI}\). It is important to note that cyclic labels, which capture the noise of the warping process, exist because \(\Phi _{(t+1, t)} \ne \Phi ^{-1}_{(t, t+1)}\).

Equation (8) represents the cyclic labels generated from a single forward and backward step. However, it is possible to perform multiple forward and backward steps, generating multiple \(L^{\circ }_{t}\) for each \(L_t\). This allows us to capture more diverse artifacts created due to \(\Phi ^{MI}\). Figure 2 shows multiple cyclic warped samples for a given label \(L_t\).

Once the refinement module is trained with the cyclic propagated label pairs, we propagate labels by a) warping, b) post-processing, and then finally c) refining with \(\texttt{LRN}_\theta \) to generate refined labels \({\hat{L}}^R_{t+p}\) [refer Eq. (4)]. This is the complete pipeline of Warp-Refine Propagation. At each time step \(t+p\), the labels \({\hat{L}}^R_{t+p}\) undergo propagation to generate \({\hat{L}}^R_{t+p+1}\).

3.3 Uncertainty Aware Training

As shown in Fig. 2, the Warp-Refine Propagation pipeline generates pseudo-labels of high quality. While these are directly beneficial for training, we note that using labels propagated over a large temporal distance (say \(t+10\)) can lead to a drop in performance. This is due to the inherent noise in the pseudo-labels. It can be significantly advantageous if we can handle the noise, as psuedo-labels further away from the annotated frame contain novel information.

Formally, let us denote labels generated from a given noisy data-generation process \(\delta \) as samples of distribution \(P_{\delta }(y|x)\), whereas the underlying label distribution is P(y|x). Typically, using model \(M_{\theta }(y|x)\) parameterized by \(\theta \), we estimate the underlying label distribution P(y|x) after attaining a maximum value for the possible log-likelihood pertaining to a model over the given data:

$$\begin{aligned} \begin{aligned} \theta ^{*} =\ {}&\underset{\theta }{\textrm{argmax}} \big [ {\mathbb {E}}_{y\sim P(y|x)} \log M_{\theta }(y|x) \big ] \\&\sim \underset{\theta }{\textrm{argmax}} \sum _{i \in D}{\log M_\theta (y_i|x_i)} \end{aligned} \end{aligned}$$
(10)

where \(\theta ^{*}\) represents the optimal parameters for \(M_\theta \). Since we contain noisy samples from \(P_{\delta }\), our model is biased to model \(P_{\delta }\), rather than P. To address the distributional shift between \(P_\delta \) and P we modify the objective of our optimization. Let us consider the relation between noisy labels and clean labels as \(P(y_\delta |x,y)\), where \(y_\delta \) represents the noisy sample for a given x. Since we want \(M_\theta \) to model the underlying label distribution P, we can rewrite our estimate \(P_\theta \) for noisy labels \(y_\delta \), and the corresponding objective as:

$$\begin{aligned}{} & {} P(y_\delta |x) = \sum _{y'}P(y_\delta | x, y')P(y=y'|x) \quad ; \quad P_\theta (y_\delta |x) = \sum _{y'}(P(y_\delta | x, y'))M_\theta (y=y'|x) \nonumber \\ \end{aligned}$$
(11)
$$\begin{aligned}{} & {} \quad \theta ^{*} \sim \underset{\theta }{\textrm{argmax}} \big [ \sum _{y_\delta \sim P_\delta }{\log P_\theta (y_\delta | x)} \big ] \end{aligned}$$
(12)

Theorem 1

Let \(\epsilon = 1-\min _{y'}P(y_\delta =y' | x, y')\). If \(\epsilon < 0.5\), then the inequality shown below is applicable to the distributions \(P(y_\delta |x)\) and \(P_\theta (y_\delta |x) \) defined in Eq.(11).

$$\begin{aligned}&d_{TV}(P(y|x), M_\theta (y|x)) \le \frac{1}{1-2\epsilon } \end{aligned}$$
(13)
$$\begin{aligned}&\quad \left( \sqrt{2KL[P(y_\delta |x)|P_\theta (y_\delta |x)]} + \gamma \right) . \end{aligned}$$
(14)

where \(d_{TV}(p(y),q(y))\) is the total variation distance and KL[p(y)|q(y)] is called the Kullback–Leibler (KL) divergence.

Therefore, our objective (12) which minimizes KL divergence between \(P(y_\delta |x)\) and \(P_\theta (y_\delta |x)\), lowers the total variation gap separating P(y|x) and \(M_\theta (y|x)\) as well (Fig. 3). Proof is provided in the supplementary.

Note that our formulation is independent of the labelling process \(\delta \), and hence can be used for multiple labelling processes \(\delta _j; j \in {\mathbb {N}}\). Now, we model \(P_\delta \) and as a noisy version of P. Taking inspiration from [28], we represent \(P_\theta (y_{\delta _j} | x)\) as:

$$\begin{aligned} P_\theta (y_{\delta _j} = k | x)&= E_{a^{i}_{j,k} \sim {\mathcal {N}}(\mu ^i_k(x), \sigma _{\delta _j}^i(x))}[{{\,\textrm{Softmax}\,}}(a^{i}_{j,k})] \end{aligned}$$
(15)

where \({{\,\textrm{Softmax}\,}}(a_k) = \exp (a_k)/\sum _{k'}\exp (a_{k'})\) is the softmax function. We adapt the model parameter \(\theta \) to model the statistics \((\mu ^i(x), \sigma _{\delta _0}^i(x),\sigma _{\delta _1}^i(x),.. )\) for each pixel i in a given image x. The optimization objective can now be written as:

$$\begin{aligned} \theta ^{*} = \underset{\theta }{\textrm{argmax}} \Bigg [ \sum _{j}{\sum _{i \in D_{{\delta _j}}}{\log P_\theta (y_{\delta _j} | x)}} \Bigg ] \end{aligned}$$
(16)
Fig. 3
figure 3

a We quantitatively demonstrate how the novel method outdoes previous futuristic methods. b [22], which uses geometric information is suspectible to poor warping, on the other hand generating pseudo-labels from semantic predictions (via network \(\texttt{S}_\theta \) poorly labels rare classes. Our method combines geometric and semantic information together, generating cleaner labels

Due to the lack of analytical solution for \( P_\theta (y_{\delta _j} | x)\), we approximate the objective by Monte Carlo integration as described in [28]. As \({{\,\textrm{Softmax}\,}}(\mu ^i)\) models the underlying distribution P(y|x), at test-time, our predictions for a given image is simply \(\mu ^i\) for each pixel i in the image.

In practice, this objective translates to modelling separate aleatoric uncertainty components for each of the noisy labelling processes \(\delta _j\). The aleatoric uncertainty is modelled by adding a small two-layer head to the given segmentation models to predict \(\sigma _{\delta _j}\).

Using Pseudo Labels: We also utilize pseudo-semantic labelling [18, 19] for data points where label propagation is not possible. For Cityscapes [14], we use predictions from a model \(S_\theta \) trained utilizing only the labels \(D_L\), to create dataset \(D_{{\hat{L}}ps}\) (specifically, labels for the train_extra subset of Cityscapes). As our training procedure is agnostic to the label generation process, we can use the different labels by simply modelling a separate uncertainty parameter \(\sigma _{\delta _j}\) for each of them.

Fig. 4
figure 4

Precision vs. Recall plot showing uncertainty effectively captures noise in label generation

Fig. 5
figure 5

Addding a single propagated frame, and relaxed label loss (RLL) [22] is ineffective

Table 1 Witness benefits of training with this study’s pseudo-labels

4 Experiments

4.1 Implementation Details

In the process of training our label refinement network (\(\texttt{LRN}_\theta \)), we employ a specific loss function known as the dual-task loss [56]. This loss function is a critical component in guiding the training process of \(\texttt{LRN}_\theta \), contributing to its ability to refine labels effectively.

In terms of training semantic segmentation networks, which are responsible for assigning labels to different parts of an image, we establish our baseline training approach by adopting the methodology outlined in [22]. This baseline serves as the foundation for our training strategy. However, we make a deliberate departure from this approach by not utilizing the Relaxed Label Loss as proposed in [22], which is explained in Sect.  4.3 of our work. Instead, we opt for a training duration of 220 epochs across all our models.

Our network architecture is primarily built upon the well-regarded DeepLabV3+ model [57], a potent solution for semantic segmentation tasks. For specific analyses, we introduce a ResNeXt [58] backbone into the architecture, enhancing the model’s capability to learn intricate features. This variation, referred to as “ablations”, systematically evaluates the impact of specific components. However, for submissions on the test set, we employ a different architecture known as WideResNet38 [59], which is optimized for generating accurate results in this context. Regarding the change in backbone architectures, it’s important to clarify the rationale behind this decision. The shift from ResNeXt to WideResNet38 is made to enhance performance specifically for test set submissions. This change suggests that the ablation experiments were not as effective when applied to the WideResNet38 backbone, leading to a preference for the WideResNet38 architecture in this specific context.

Dataset Details: The ApolloScape dataset [15] is an extensive collection comprising 143,906 images that have been meticulously annotated. However, in order to make our research manageable and focused, we take steps to create specific subsets. Initially, we craft two fundamental subsets: a training subset containing 40,100 images and a validation subset encompassing 6113 images. This segmentation enables us to systematically analyze our models. Within the training subset, we further establish two distinct partitions: \(D_I\) and \(D_{Iseq}\). The \(D_I\) subset is composed of 2005 images, while \(D_{Iseq}\) comprises a more extensive collection of 40,100 images. This division is carried out by creating continuous sequences of 21 frames each.Importantly, the ApolloScape dataset is noteworthy for its comprehensive annotations, encompassing annotations for all frames. This characteristic allows us to establish a reference point known as the ground truth label set \(D_{Lseq}\). This set is a compilation of annotations that are considered pristine and accurate, constituting a benchmark for comparison. As part of our evaluation process, we subject models to scrutiny that have undergone training with both "clean" annotations (those present in the original dataset) and "propagated" annotations (annotations generated through our techniques). The evaluation is conducted on a validation subset that remains untouched and separate throughout the training process, providing an unbiased assessment of model performance.

Shifting our attention to the Cityscapes dataset [14], a separate dataset we incorporate, it encompasses 5000 annotated images. To ensure a systematic analysis, we divide this dataset into distinct segments. The \(D_I\) subset, with 2975 images, serves as our training set. Additionally, we allocate subsets for validation (comprising 500 images) and testing (encompassing 1525 images) purposes. Within the \(D_I\) subset, we identify two specific subcategories: \(D_{Iseq}\) and \(D_{Ips}\). The \(D_{Iseq}\) subset involves sequential images devoid of annotations, allowing us to explore scenarios where annotations are absent. On the other hand, \(D_{Ips}\) contains annotations of a coarse nature, providing a different aspect of data quality. An intriguing aspect of our approach involves the handling of rough labels. In instances where such labels are present, we choose to discard them. Instead, we embrace a technique called pseudo-labelling to create annotations for the subset referred to as \(D_{{\hat{L}}ps}\).

4.2 Evaluating Label Propagation

We quantitatively establish this study’s methodology is better at propagating label-mechanisms with significantly lesser noise than comparable techniques. Using the clean annotation \(D_{Lseq}\) in Apolloscape we evaluate different the propagation methods. In the ApolloScape dataset, from the manually annotated labels in \(D_L\), we generate the approximated labels \(D_{{\hat{L}}seq}\) for each propagation technique, and evaluate it against the given annotated labels \(D_{Lseq}\).

Figure 5 describes the mean Intersection over Union (mIoU) of different propagation methods at each propagation length. This is evaluated on the sequences adjacent to the training set itself, as label propagation is also conducted on those sequences only. We compare against other comparable methods [22], and predictions from a segmentation model \(\texttt{S}_\theta \) trained on \(D_L\) only. Our label propagation method surpasses the other methods, and as shown, [22] quickly start performing even worse than \(\texttt{S}_\theta \).

4.3 Learning with Generated Labels

We evaluate the improvement in semantic segmentation by applying generated labels to our training process and noting mean values in each metric over three different runs.

First, we present results countering the claims of the previous work [22]. Specifically, we find [22] to be ineffective in improving semantic segmentation. The baseline reported in [22] is trained for one-third the iterations with a suboptimal learning rate (Table 2). By equalizing the number of training iterations, and increasing learning rate, we find that the baseline is able to match the proposed models from [22].

Table 2 State-of-the-art methods on the Cityscapes dataset (test partition)
Table 3 We show the benefit of training with our pseudo-labels

To benefit from propagated data, we find it essential to include propagated samples from multiple timesteps. Following [24], for each label in \(D_L\), we include propagated labels at timesteps \(t \pm p\) where \(p \in \{2,4,6,8\}\) from \(D_{{\hat{L}}seq}\). Furthermore, we include pseudo-labelling on the train_extra subset \(D_{{\hat{L}}ps}\) as well. Table 3 displays results. The propagated labels, when used with the Uncertainty-Aware Training (UAT), enable us to boost the performance 0.49 mIoU. The performance is further boosted by 0.47 mIOU when \(D_{{\hat{L}}ps}\) is used as well. Note how UAT is critical in increasing performance in the presence of propagated labels.

Finally, using the same method, it is observed to improve upon the results from methods considered state-of-the-art on Cityscapes (when training with fine labels only). Table3 shows the improvement by training with our method. We observe that in the presence of coarse labels, and Mapillary Vistas pretraining [66], the benefits of label propagation are not clear. This is expected as label propagation cannot be performed for the coarse labels and the Mapillary Vistas dataset and hence, in the presence of those labels, propagation is performed for only \(\sim \) 10% of the entire dataset.

Evaluation on ApolloScape: In Table 1 we show the benefit of training with propagated labels and the uncertainty-aware training regime on Apolloscape dataset.

Modelling Label Noise with Uncertainty: We demonstrate that the uncertainty estimates are able to model the noise in the data-generation process.

We rigorously evaluate the impact of incorporating generated labels on semantic segmentation performance. Contrary to a previous study [22], we demonstrate that their baseline approach falls short due to inadequate training parameters. By aligning training iterations and improving learning rates, we establish parity with their proposed models. Leveraging propagated labels from various time steps and applying Uncertainty-Aware Training (UAT) leads to a noteworthy 0.49 mIoU enhancement. Extending this with pseudo-labelling on the \(D_{{\hat{L}}ps}\) subset further boosts performance by 0.47 mIoU. The pivotal role of UAT in improving performance is evident, especially in the presence of propagated labels. This strategy also outperforms state-of-the-art results on the Cityscapes dataset and yields similar gains on the ApolloScape dataset, as detailed in Tables 1 and 3. Furthermore, we showcase how uncertainty estimates effectively model noise in data generation through precision-recall curves in Fig. 4, underscoring the accuracy of our approach.

5 Conclusion

Our work addresses two questions: propagating labels, and training with noisy pseudo-labels. Our propagation method utilizes the concept of cycle consistency of labels to significantly improve label propagation. Further, our noisy label learning approach effectively utilizes uncertainty and ameliorates drawbacks chanced upon when training with coarse labels. Our techniques achieve enhanced outcomes when run by Cityscapes. We can conclude that noisy label learning approach opens the door to utilizing more noisy augmentation processes such as image-based rendering methods , and in later endeavors, we hope to explore in said direction.

6 Potential Broader Impact

This research can be beneficial to companies or institutions requiring applications for semantic understanding of video data such as autonomous driving. If autonomous driving becomes an ubiquitous reality, humans currently working with manual car driving labor could be at disadvantage from the results of this research, although indirectly. If an autonomous driving system fails due to the proposed component, the consequence could be an accident or the loss of human life. We believe our proposed method leverages some bias in the data, as the distribution within the set meant for training purposes will influence representations that can be picked up by a downstream semantic image segmentation network.