1 Introduction

Advances in visual scene understanding using deep learning, with convolutional neural network architectures and large annotated image collections (Chen et al. 2016, 2018; Xie et al. 2016; Luo et al. 2015), have achieved excellent performance in per-pixel labelling of semantic categories in real-world scenes from images.

These advances in semantic segmentation have been exploited to improve scene flow estimation between pairs of frames for dynamic scenes (Behl et al. 2017). However semantic segmentation from a single view suffers from errors due to the inherent visual ambiguity which leads to errors in flow estimation at object boundaries and for regions of uniform appearance. Errors may also be introduced in scene flow estimated between pairs of frames due to large non-rigid motions and self-occlusions for dynamic sequences. In the case of multiple views, independent classification for different views and different time instants of the same scene may result in inconsistent per-pixel flow and semantic labelling for the same object.

This paper introduces a framework for semantically coherent long-term 4D scene flow (aligning entire dynamic sequence of \(>150\) frames), co-segmentation and reconstruction of dynamic scenes, as shown in Fig. 1 for the publicly available Juggler dataset (Ballan et al. 2010) captured with 6 hand-held unsynchronised moving cameras. Joint semantic co-segmentation(top-row), flow estimation, and 4D reconstruction (bottom-row) results in significant improvement in per-view 2D segmentation, 4D scene flow and reconstruction. The approach enforces semantic coherence both spatially across different views of the scene and temporally across different observations of the same object for robust long-term 4D flow estimation. Semantic tracklets are introduced to identify similar frames in time across a sequence, exploiting semantic, motion, shape and appearance information between different observations of a dynamic object over time. This gives improved temporal coherence enabling long-term flow estimation along-with consistent semantic co-segmentation of long sequences across multiple views. Joint semantic scene flow, co-segmentation, and reconstruction enforces spatio-temporal semantic coherence in flow estimation resulting in improved performance over previous approaches which did not exploit semantic and depth information in space and time.

Fig. 1
figure 1

Example of input image from Juggler dataset (Ballan et al. 2010) and proposed framework resulting in an accurately labeled segmentation, 4D reconstruction and scene flow (represented by color mask propagation in dynamic object of the scene) (Color figure online)

Previous research has demonstrated the advantages of joint semantic segmentation and flow estimation (Tokmakov et al. 2019; Behl et al. 2017; Sevilla-Lara et al. 2016a; Zhu et al. 2018), joint segmentation and reconstruction across multiple views (Yang et al. 2018; Hane et al. 2013, 2016; Engelmann et al. 2016; Kundu et al. 2014), co-segmentation of multiple view images (Khoreva et al. 2019; Chiu and Fritz 2013; Kolev et al. 2012; Djelouah et al. 2015, 2016) and temporal coherence in reconstruction (Li et al. 2018; Goldluecke and Magnor 2004; Floros and Leibe 2012; Larsen et al. 2007; Mustafa et al. 2016a). Our contribution is the introduction of a framework for joint semantically coherent 4D scene flow with co-segmentation and reconstruction of complex dynamic scenes to obtain semantically coherent per-view long-term scene flow, 2D object segmentation and 4D scene reconstruction from wide-baseline camera views. Our approach to long-term scene flow, co-segmentation and 4D dynamic shape reconstruction leverages recent advances in single-view semantic segmentation and semantic flow estimation.

The input to the framework is multi-view videos. Per-view initial semantic segmentation is obtained using Mask RCNN (He et al. 2017) and FCN (Chen et al. 2018), this could in principle use any semantic video segmentation approach. The initial semantic segmentation is combined with sparse reconstruction to obtain initial semantic reconstruction. A joint semantic flow, co-segmentation and reconstruction optimization is proposed to refine this initial segmentation and reconstruction. Semantic coherence is enforced using semantic tracklets, which link frames to enforce temporal coherence between widely spaced timeframes. Semantic coherence refers to spatial and temporal coherence of semantic labels across the sequence. The per-view semantic flow and reconstruction is combined across views for entire dynamic sequence to obtain semantically coherent long-term dense 4D scene flow, co-segmentation and reconstruction.

The primary contribution is semantically coherent scene flow, semantic co-segmentation and 4D reconstruction across multiple views. An initial version of this work was published in CVPR (Mustafa and Hilton 2017) where we proposed a method for semantic segmentation and reconstruction of dynamic scenes. The contributions of this paper over our previous work are as follows: (a) Semantically coherent long-term 4D scene flow estimation for dynamic scenes in addition to semantic segmentation and reconstruction; (b) Refined methodology enabling joint semantically coherent scene flow, co-segmentation and reconstruction by adding motion optimization in the energy defined in Eq. 6. The resulting 2D flow is projected to the 3D reconstruction to obtain the final 4D scene flow; (c) Refined methodology to estimate semantic tracklets by adding motion constrain in Eq. 1 and (d) Comprehensive performance evaluation of flow, segmentation, and reconstruction on challenging datasets. To the best of our knowledge, this is the first method addressing the problem of semantically and temporally coherent long-term 4D scene flow; semantic co-segmentation and reconstruction for dynamic scenes. The contributions of the paper include:

  • A method to estimate scene flow, 4D mesh and 2D semantic video segmentation for natural dynamic scenes from multi-view videos.

  • Joint semantic scene flow, co-segmentation, and reconstruction of dynamic objects in complex scenes exploiting spatial and temporal coherence.

  • Semantic tracklets for long-term 4D reconstruction by enforcing spatial and temporal coherence in semantic labelling for improved scene flow of video across wide-timeframes.

  • Improved flow, segmentation, and reconstruction of dynamic scenes from multiple moving cameras

2 Related Work

2.1 Semantic Segmentation

Various methods have been proposed in the literature for semantic segmentation of images. In the first category the image is initially segmented followed by a per-segment object category classification (Mostajabi et al. 2015; Gupta et al. 2014). However, errors in segmentation propagate to the semantic labelling. Several papers address these issues by proposing deep per-pixel CNN features followed by classification of each pixel in the image (Farabet et al. 2013; Hariharan et al. 2015). The per-pixel prediction leads to segmentations with fuzzy boundaries and spatially disjoint regions. Another group of methods pioneered by Long et al. (2015) and He et al. (2017) predict segmentations from the raw pixels. Methods were introduced to improve the spatial coherence of the semantic segmentation using conditional random fields (CRF) (Kundu et al. 2016; Zheng et al. 2015; Chen et al. 2014). End-to-end methods were proposed for semantic segmentation to overcome the limitations of methods using CRF (Chen et al. 2018; Zhang et al. 2018), improving the performance significantly.

Co-segmentation: This was first introduced by Rother et al. (2006) for simultaneous binary segmentation of object parts in an image pair and extended to simultaneous segmentation of multiple images (Batra et al. 2010). Multi-view co-segmentation in space and time was introduced in Djelouah et al. (2016). A common foreground is obtained from multiple views using the information from appearance and motion cues. Semantic co-segmentation methods from a single video use spatio-temporal object proposals (Joulin et al. 2012; Luo et al. 2015), segments (Kolev et al. 2012), motion (Rother et al. 2006) and foreground propagation (Goldluecke and Magnor 2004). Recently, co-segmentation methods were introduced to segment common objects in a collection of videos for a single object (Maninis et al. 2018; Fu et al. 2014) or multiple objects (Tokmakov et al. 2019; Chiu and Fritz 2013; Zhong and Yang 2016). A CNN method for both single and multiple object segmentation was introduced in Khoreva et al. (2019), exploiting an intuitive training strategy from less data.

2.2 Semantic Flow Estimation

Methods have been proposed to exploit semantic information to improve monocular flow or motion estimate per frame (Li et al. 2018; Behl et al. 2017; Sevilla-Lara et al. 2016a; Zhu et al. 2018; Tsai et al. 2016). Semantic 2D detections were exploited to improve the tracking for autonomous driving in Li et al. (2018). Advantages of segmentations, bounding boxes and object coordinates to flow estimation were reviewed in Behl et al. (2017) for the case of dynamic road scenes. Sevilla-Lara et al. (2016a) exploit the advances in static semantic segmentation to segment the image into objects of different types followed by modelling motion for each object depending on the type of object. However for non-rigid dynamic objects such as people defining a unique motion model for the entire object is not effective. A method to exploit flow information for video segmentation was proposed in Tsai et al. (2016), reporting improvement in video segmentation exploiting flow information. However all of these methods either work for street scenes or static scenes and do not exploit any stereo or multiple view information.

2.3 Joint Estimation

General multi-view image segmentation methods use appearance and contrast information which may not be sufficient in the case of complex real world scenes. To improve the results joint optimisation of segmentation with 3D reconstruction has been proposed (Mustafa et al. 2016a) by including the multiple view photo-consistency. This concept was extended to semantic segmentation and reconstruction to obtain additional information from the scene (Jiao et al. 2018; Hane et al. 2016; Xie et al. 2016). Methods were introduced to utilize appearance-based pixel categories and stereo cues in a joint framework for street scenes from a monocular camera (Vineet et al. 2015; Floros and Leibe 2012). These methods used CRF to perform simultaneous dense reconstruction and segmentation of street scenes captured from a moving camera. A method to estimate pose and shape of people was proposed in Zanfir et al. (2018) and another method to estimate the pose and 3D shape of rigid objects on street scenes was proposed (Engelmann et al. 2016). An unsupervised method to jointly learn depth and flow using cross-task consistency was proposed for monocular video (Zou et al. 2018). Another method jointly estimates dense depth, optical flow and camera pose (Yin and Shi 2018). Recently a method was proposed for joint unsupervised learning of depth, camera, motion, optical flow and motion segmentation (Ranjan et al. 2018). However these methods cannot be directly applied to multi-view wide-baseline scenes. A method for joint estimation of 3D geometry and pose was proposed for rigid objects (Tulsiani et al. 2018). Dense semantic reconstruction of rigid objects was proposed by Bao et al. (2013). Joint semantic segmentation and reconstruction using multiple images was proposed for static scenes (Hane et al. 2013). However, these methods are limited to static scenes and rigid objects.

Fig. 2
figure 2

Semantically coherent co-segmentation, reconstruction and flow estimation framework

Joint motion and reconstruction or segmentation (Roussos et al. 2012; Sevilla-Lara et al. 2016b) methods were proposed for dynamic scenes. Techniques have been introduced to align dense meshes using correspondence information between consecutive frames (Zanfir and Sminchisescu 2015; Mustafa et al. 2016b) or extracting the scene flow by estimating the pairwise surface or volume correspondence between reconstructions at successive frames (Wedel et al. 2011; Basha et al. 2010). State-of-the-art joint estimation methods give per frame reconstruction and semantic segmentation of the scenes (Chen et al. 2019; Kendall et al. 2017) exploiting a multi-task learning framework. However these methods do not align meshes for the entire sequence, give semantically coherent segmentation, or work for wide-baseline scenes. Our previous work (Mustafa and Hilton 2017) gives per-frame semantic segmentation and reconstruction of dynamic scenes, leading to unaligned meshes for dynamic sequence. The proposed method estimates 4D scene flow along with reconstruction and semantic co-segmentation, aligning meshes for entire dynamic sequence giving long-term semantic 4D scene flow.

Fig. 3
figure 3

The improvement of semantic segmentation using the proposed framework for Odzemok and Mgician datasets

This paper introduces joint semantic flow, co-segment-ation, and reconstruction enforcing coherence in both the spatial and temporal domains for scenes, with rigid and non-rigid dynamic objects, captured with multiple wide-baseline moving cameras. A key contribution of our work is that we combine semantics, shape, motion and appearance information in space and time in a single optimization to generate results automatically. The per-view motion, depth and semantic segmentation is combined across views and time for entire dynamic sequence to obtain 4D semantic flow. Evaluation demonstrates improved accuracy and completeness of flow, segmentation and reconstruction for complex dynamic scenes.

3 Semantic 4D Scene Flow and Segmentation

Overview:

This section gives an overview of the proposed framework for semantic temporal coherence, illustrated in Fig. 2. It comprises of following stages:

  • Input: Multi-view videos are input to the system.

  • Initial Semantic Segmentation—Sect. 3.1:

    Initial semantic labels are estimated for each pixel in the image per-view using state-of-the-art semantic segmentation (He et al. 2017; Chen et al. 2018).

  • Initial Semantic Reconstruction—Sect. 3.1:

    Semantic information for each view is combined with sparse 3D feature correspondence between views to obtain an initial semantic 3D reconstruction. This initial reconstruction combines semantic information across views but results in inconsistency due to inaccuracies in the initial per-view segmentation.

  • Semantic Tracklets—Sect. 3.2:

    To enforce long-term semantic coherence temporally we propose semantic tracklets that identify a set of similar frames for each dynamic object. Similarity between any pair of frames is estimated from the per-view semantic labels, appearance, shape and motion information.

    Semantic trackets provide a prior for the joint space-time semantic co-segmentation and reconstruction to enforce temporal coherence.

  • Joint Semantic Flow, Co-segmentation and Reconstruction—Sect. 3.3: The initial semantic segmentation and reconstruction is refined per-view for each dynamic object through joint optimisation of flow, segmentation, and shape across multiple views and over time using the semantic tracklets. Per-view information is merged into a single 3D model using Poisson surface reconstruction (Kazhdan et al. 2006).

  • Semantic 4D Scene Flow and Segmentation—Sect. 3.3: The process is repeated for the entire sequence and is combined across views and in time to obtain semantically coherent long-term dense 4D scene flow, co-segmentation, and reconstruction for the complete scene.

The following sections include a detailed explanation of the proposed approach and highlight the novel contributions.

3.1 Initial Segmentation and Reconstruction

Initial Semantic Segmentation: Mask RCNN is used for initial semantic segmentation because it is the state-of-the-art object detector that computes per instance masks and per instance class labels. This adopts a two-stage procedure to predict semantic segmentation of images. The object masks from Mask RCNN (He et al. 2017) are combined with background segmentation (Chen et al. 2018) to obtain dense semantic segmentation mask. For each frame in the sequence we perform deep semantic segmentation which estimates the probabilities of various classes at each pixel in the image. The network is trained on MS-COCO (Lin et al. 2014) dataset with 81 classes and is refined on PASCAL VOC12 (Everingham et al. 2012) dataset. In spite of being the state-of-the-art method the masks output still do not accurately align with the object boundaries as illustrated in Fig. 3b.

Initial Semantic Reconstruction: Sparse feature-based reconstruction of the scene is performed using SFD features (Mustafa et al. 2019) and SIFT descriptor (Lowe 2004) with the constraint that each 3D feature should be visible in 3 or more camera views for robustness (Hartley and Zisserman 2003). The resulting point-cloud is clustered in 3D (Rusu 2009). Clusters are formed between points with the same class labels across multiple views such that each cluster represents a semantically consistent object. Insufficient 3D features may occur on parts of an object due to lack of texture or visual ambiguity. To avoid incomplete reconstruction the sparse 3D object clusters are combined with the initial semantic segmentation to obtain the initial semantic reconstruction. A mesh is obtained for sparse 3D point clusters by triangulation to obtain an initial coarse reconstruction for each object. The initial coarse reconstruction is back-projected in each view onto the initial semantic segmentation. If the back-projected mask is smaller than its respective semantic region in 2 or more views then the initial coarse reconstruction is dilated in volume(3D) by v to enclose the object to match the segmentation boundaries in each view: \(v = \frac{1}{N_h} * \sum _{c=1}^{N_h}\frac{B_{s}^{c} - B_{r}^{c}}{B_{s}^{i}}\), where \(N_h\) is the number of views with smaller back-projected mask, \(B_{s}^{i}\) is the area of the semantic segmentation and \(B_{r}^{i}\) is the area of the back-projected mask of the initial coarse reconstruction. This automatically initializes the reconstruction of each object in the scene without any strong initial prior.

Fig. 4
figure 4

Example of dynamic tracklet generation (similar frames) for a dynamic object at current frame 53 based on appearance, shape and semantic information. The spatial and temporal neighbourhood are shown at the top in green and yellow respectively for the optimization (Color figure online)

3.2 Semantic Tracklets

In the case of general dynamic scenes with non-rigid objects, independent per-frame scene flow estimation, segmentation and reconstruction leads to incoherent results, for example failure to predict flow and reconstruct thin structures such as limbs and poorly localized object boundaries. Sequential methods for frame-to-frame temporal coherence are prone to errors due to drift and rapid motion (Beeler et al. 2011; Prada et al. 2016). Previous work Zhong and Yang (2016) introduced semantic tracklets for object segmentation in single view video based on co-segmentation across video collections. In this paper to achieve long-term scene flow, semantic co-segmentation and robust temporally coherent 4D reconstruction by introducing semantic tracklets which link instances of dynamic objects across wide-timeframes. This provides a prior to constrain long-term flow, co-segmentation and reconstruction. In our work semantic tracklets are defined for multiple views of the same dynamic scene to ensure temporal and spatial coherence in semantic 4D flow and 2D labelling, whereas in Zhong and Yang (2016) tracklets segment objects in a single video and relate them to similar object instances in multiple videos.

Semantic tracklets for a dynamic object are defined as a set of frames which have similar motion across 3 or more views, semantic labels, appearance and 2D shape as illustrated in Fig. 4. Tracklets are used for long-term learning of flow, semantic labels, appearance and shape information for per-view joint semantic 4D scene flow, co-segmentation and reconstruction of each object. This improves the temporal and semantic coherence in flow, reconstruction and segmentation results as shown in Fig. 5.

Fig. 5
figure 5

Comparison of segmentation of the proposed multi view optimization against optimization with no semantic and no tracklet information respectively for Handshake and Odzemok datasets

Fig. 6
figure 6

Similarity matrix for each component in semantic tracklet estimation, along with all the components combined matrix

Dynamic objects are identified in the scene using motion information from sparse temporal SFD feature correspondences with SIFT descriptors. The semantic, 2D shape, motion and appearance similarity of the dynamic object is evaluated for each frame against all previous frames to identify the set of similar frames which form a tracklet. Similarity metric is defined as follows:

$$\begin{aligned} S_{i,j} = \frac{1}{4 N_{v}}\sum _{c=1}^{N_{v}} ( C_{i,j}^{c} + M_{i,j}^{c} + J_{i,j}^{c} + L_{i,j}^{c} ) \end{aligned}$$
(1)

where C() is the measure of appearance similarity, M() is the measure of motion similarity, J() is the measure of shape similarity and L() is the measure of semantic similarity. \(N_{v}\) is the number of views at each frame. These similarities are combined across time and views and all frames with similarity \(>0.75\) are selected as \(N_S\) similar frames to form a semantic tracklet \(T_{i}\) for each dynamic object at the \(i^{th}\) frame, \(T_{i} = \left\{ t_r \right\} _{r=1}^{N_S}\), where \(t_r \in [0,i-1]\). An example of the frame-to-frame similarities is illustrated in Fig. 6 for Juggler sequence, depicting the differences in various measures and the overall similarity matrix.

Semantic Similarity: The semantic region associated with the object at each frame is identified using sparse wide-timeframe SFD feature matches combined with SIFT descriptor. An affine warp (Evangelidis and Psarakis 2008) based on the feature correspondence and region boundary is employed to transfer the semantic region segmentation to the current frame. The semantic similarity metric \(L_{i,j}^{c}\) is defined as the ratio of the number of pixels with the same class label \(z_{i,j}^{c}\) to the total number or pixels in the segmented region \(y_{i,j}^{c}\) at frame i and j for view c:

$$\begin{aligned} L_{i,j}^{c} = \frac{z_{i,j}^{c}}{y_{i,j}^{c}} \end{aligned}$$
(2)

Appearance Similarity: The appearance metric \(C_{i,j}^{c}\) between frame i and j for the semantic region segmentation in view c corresponding to a dynamic object is based on the ratio of the number of temporal feature correspondences which are consistent across three or more views \(q_{i,j}^{c}\) to the total number of feature correspondence in the segmented region \(u_{i,j}^{c}\) (Mustafa et al. 2016b):

$$\begin{aligned} C_{i,j}^{c} = \frac{q_{i,j}^{c}}{u_{i,j}^{c}} \end{aligned}$$
(3)

Motion Similarity: The motion metric \(M_{i,j}^{c}\) between frame i and j for the semantic region segmentation in view c corresponding to a dynamic object is based on the average motion for the object across three or more views \(s_{i,j}^{c}\) to the maximum motion between frames for entire sequence max (Mustafa et al. 2016b):

$$\begin{aligned} M_{i,j}^{c} = \frac{s_{i,j}^{c}}{max} \end{aligned}$$
(4)

Shape Similarity: This metric gives a measure of the 2D region shape similarity between pairs of frames for each dynamic object. Semantic region segmentations are aligned using an affine warp (Evangelidis and Psarakis 2008). This is defined as the ratio of the intersection of the aligned segmentation \(h_{i,j}^{c}\) to the union of the area \(a_{i,j}^{c}\):

$$\begin{aligned} J_{i,j}^{c} = \frac{h_{i,j}^{c}}{a_{i,j}^{c}} \end{aligned}$$
(5)

Importance of Semantic Tracklet: Semantic tracklets provide both temporal and multi-view priors for semantic 4D long-term flow estimation and co-segmentation. This is the importance of semantics in obtaining improved scene flow, segmentation and 4D reconstruction. Comparison is presented for optimization with/without semantic label and temporal tracklet information for multiple views in Fig. 5. Semantic tracklets result in significant improvement in scene flow estimation, reconstruction and multi-view video segmentation in comparison to state-of-the-art methods, as demonstrated in Sect. 4. The importance of the proposed semantically coherent optimization exploiting the information from semantic labels and tracklets for proposed multi-view joint optimization is shown in the Fig. 5. The proposed approach consistently performs better giving a more accurate flow and segmentation. The final proposed multiple view 4D flow, co-segmentation and reconstruction method using both semantic labels and tracklets gives significantly improved and more robust 4D flow and segmentation.

3.3 Joint Semantic Scene Flow, Co-segmentation and Reconstruction

The goal of multi-view joint semantic flow estimation, co-segmentation and reconstruction is to refine the initial semantic reconstruction obtained in Sect. 3.1 for each dynamic object for the region \({\mathscr {R}}\) per-view by optimizing the following variables: (a) Translation for each pixel location \({p} = (x_p, y_p)\) in image I, \(m_p = (\delta x_p, \delta y_p)\) in time from a predefined set of flow vectors \({\mathscr {M}}\); (b) A semantic label from a set of semantic classes obtained as an initialization (Sect. 3.1), \({\mathscr {L}} = \left\{ l_{1},\ldots ,l_{\left| {\mathscr {L}} \right| } \right\} \), to each pixel p for the initial semantic segmentation region \({\mathscr {S}}\) of each object, where \(\left| {\mathscr {L}} \right| \) is the total number of classes in the network; and (c) An accurate depth value is jointly assigned for each pixel p from a set of depth values \({\mathscr {D}} = \left\{ d_{1},\ldots ,d_{\left| {\mathscr {D}} \right| -1}, {\mathscr {U}} \right\} \), where \(d_{i}\) is obtained by sampling the optical ray from the camera and \({\mathscr {U}}\) is an unknown depth value to handle occlusions.

Long-term 4D flow and co-segmentation is achieved by propagating the semantic labels across views and over time using tracklets in the framework. Formulation of a cost function for semantically coherent depth and motion estimation and co-segmentation is based on the following principles:

  • Local spatio-temporal coherence: Spatially and temporally neighbouring pixels are likely have the same semantic labels if they have similar appearance.

  • Multi-view coherence: The surface is photo-consistent and semantically consistent across multiple views.

  • Depth variation: The depth at spatially neighbouring pixels within an object varies smoothly for most of the surface (except internal depth discontinuities).

  • Long-term temporal coherence: The semantic labels on each object remain consistent across a long time-frames in a sequence.

The cost function enforces spatial and temporal constraints on the semantic, appearance, motion and shape. Temporal semantic coherence is enforced using tracklets based on dynamic object similarity \(S_{i,j}\) Eq. 1. An example of multi-view semantic scene flow, segmentation and reconstruction is shown in Fig. 3c. Enforcing temporal coherence with semantic tracklets for a multi-view video reduces noise in per-pixel labels. Errors in object segmentation remain due to the low spatial resolution of the initial semantic boundaries and visual ambiguity is addressed by combining information across multiple views. Joint optimisation of multiple view scene flow, co-segmentation and reconstruction minimises:

$$\begin{aligned} E(l,d,m)= & {} \lambda _{d}E_{d}(d) + \lambda _{a}E_{a}(l) + \lambda _{c}E_{c}(l) \nonumber \\&+ \,\lambda _{sm}E_{sm}(l,d) + \lambda _{s}E_{s}(l,d) + \lambda _{m}E_{m}(l,m)\nonumber \\ \end{aligned}$$
(6)

where, d is the depth at each pixel, m is the motion and l is the semantic label. \(E_d()\) is the matching/depth cost, \(E_a()\) is the appearance/color cost, \(E_c()\) is the constrast cost, \(E_{sm}()\) is the semantic labelling cost, \(E_s()\) is the smoothness cost, and \(E_m()\) is the motion/flow cost. Individual cost terms enforce spatial and temporal coherence for dynamic objects in semantic labels, appearance, region boundary contrast and motion cost. This is solved subject to a geodesic star-convexity constraint on the semantic labels l (Mustafa et al. 2016a):

$$\begin{aligned} \underset{(l,d,m)}{min} \text { }\underset{l\epsilon S^{\star }({\mathscr {C}})}{E(l,d,m)} \Leftrightarrow \min _{(l,d,m)}E(l,d,m) + E^{\star }(l |x, {\mathscr {C}}) \end{aligned}$$
(7)

where \( S^{\star }({\mathscr {C}})\) is the set of all shapes which are geodesic star-convex wrt the features in \({\mathscr {C}} = \left\{ c_{1},\ldots ,c_{n} \right\} \) within the initial semantic segmentation \({\mathscr {R}}\). \(E^{\star }(l |x, {\mathscr {C}})\) is the geodesic star-convexity constraint enforced on the semantic labels l. \(\alpha \)-expansion is used to iterate through the set of labels in \({\mathscr {L}} \times {\mathscr {D}} \times {\mathscr {M}}\) (Boykov et al. 2001) and a solution is obtained using graph-cuts (Boykov and Kolmogorov 2004) across spatial and temporal neighbourhoods as shown in Fig. 4. The initially reconstructed surface \({\mathscr {R}}\) is updated by minimizing the Energy in Eq. 6, by estimating the depth, segmentation and motion at each pixel within the projection of region \({\mathscr {R}}\) in each view.

Spatial neighbourhood: The spatial neighbourhood is defined as pairs of spatially close pixels in the image domain. A standard 8-connected spatial neighbourhood is used denoted by \(\psi _S\); the set of pixel pairs (pq) such that p and q belong to the same frame and are spatially connected.

Temporal neighbourhood: The temporal neighbourhood is defined based on the set of tracklets \(T_{i}\) generated for any frame i. Optical flow is used to compute a dense flow field on the tracklets, initialized from the sparse temporal SIFT feature correspondences. EpicFlow (Revaud et al. 2015) is used to preserve large displacements as the tracklets are distributed widely in time, and forward-backward flow consistency is enforced. Optical flow vectors define the temporal neighbourhood \(\psi _T = \left\{ \left( p,q \right) \mid q = p + d_{i,j} \right\} \); where j is the number of a frame in tracklet \(T_{i} = \left\{ j=t_r \right\} \), and \(d_{i,j}\) is the displacement vector from image i to j.

Semantic cost\(E_{sm}(l,d)\): This term enforces multi-view consistency on the semantic labels of each pixel p. Inconsistent labels across views are penalised to ensure semantic coherence. This cost is computed based on the probability of the class labels at each pixel for the initial semantic segmentation (Chen et al. 2016). Unlike previous approaches to achieve semantic coherence we enforce spatial and temporal consistency using tracklets across the neighbourhoods. The term is defined as:

$$\begin{aligned} E_{sm}(l,d) = \sum _{p\in \psi _S} e_{sm}(p, d_{p}, l_{p}) \end{aligned}$$

\(e_{sm}(p, d_{p}, l_{p}) = \sum _{c=1}^{N_{K}} z(p,r, l_p) \) , if \(d_{p}\ne {\mathscr {U}}\) else a fixed cost \(S_{{\mathscr {U}}}\) is assigned. A 3D point \(P(p,d_p)\) is assumed along the optical ray passing through pixel p located at a distance \(d_{p}\) from the reference camera. The projection of hypothesized point \(P(p,d_p)\) in view c is defined by \(r = \phi _{c}(P)\). \(N_{K}\) is the total number of views in which point \(P(p,d_p)\) is visible.

$$\begin{aligned} z(p,r, l_p) =\left\{ \begin{array}{ll} -log P_{sem}(I_{p}|l_{p}) &{}\quad \text {if } l_{p} = l_{r}\\ -log \left( 1- P_{sem}(I_{p}|l_{p}) \right) &{}\quad \text {if } l_{p} \ne l_{r} \end{array}\right. \end{aligned}$$

where \(l_{r}\) is the semantic label at pixel r in view c and \(P_{sem}(I_{p}|l_{p} = l_i)\) denotes the probability of the semantic label \(l_i\) at pixel p in the classification image obtained from initial semantic segmentation.

Contrast cost\(E_{c}(l)\): The contrast cost (Chen et al. 2016) is modified to introduce spatial and temporal semantic coherence and ensure that for dynamic objects the region boundaries have high contrast. Semantic region boundaries are propagated using the tracklets as a prior for the optimization:

$$\begin{aligned} E_{c}(l)= & {} \sum _{p,q \in \psi _T} e_{c}(p,q,l_p,l_q,\sigma _{\alpha }^{t},\vartheta _{pq}^{t},\sigma _{\beta }^{t}) \\&+\sum _{p,q \in \psi _S} e_{c}(p,q,l_p,l_q,\sigma _{\alpha }^{s},\vartheta _{pq}^{s},\sigma _{\beta }^{s})\\&e_{c}(p,q,l_p,l_q,\sigma _{\alpha },\vartheta _{pq},\sigma _{\beta }) = \mu \left( l_p,l_q \right) \\&\times \left( \lambda _{ca} exp^{-\left( \frac{\left\| B(p) - B(q) \right\| ^{2}}{2 \left( \sigma _{\alpha } \right) ^{2}\left( \vartheta _{pq} \right) ^{2} } \right) } + \lambda _{cl} exp^{-\left( \frac{\left\| L(p) - L(q) \right\| ^{2}}{2 \left( \sigma _{\gamma } \right) ^{2}} \right) } \right) \end{aligned}$$

where \( \mu \left( l_p,l_q \right) = 1 \text { if } (l_{p} \ne l_{q}) \text { else } 0\) and \(\vartheta _{pq}\) is the Euclidean distance between pixel p and q. The first Gaussian kernel is a bilateral kernel which depends on RGB color (B() is bilateral filtered image) and pixel positions, and the second kernel only depends on pixel positions L. The parameters \(\sigma _{\alpha }\), \(\sigma _{\beta }\) and \(\sigma _{\gamma }\) control the scale of the Gaussian kernels. The first kernel forces pixels with similar color and position to have similar labels, while the second kernel only considers semantic spatial proximity when enforcing smoothness. The value of \(\sigma _{\alpha } = \left\langle \frac{\left\| B(p) - B(p)\right\| ^{2}}{\vartheta _{pq}^{2}}\right\rangle \), with the operator \( \langle \rangle \) denoting the mean computed across the neighbourhoods \(\psi _S\) and \(\psi _T\) for spatial and temporally coherent contrast respectively.

Appearance cost\(E_{a}(l)\): This cost is computed using the negative log likelihood (Boykov and Kolmogorov 2004) of the color models learned from the foreground object and background. In this work the foreground models are learnt from the sparse features of the dynamic object in the current frame and foreground regions from tracklets to improve the consistency of the results. Static background models are learnt from the sparse features outside the initial semantic segmentation of the dynamic object in the current frame and the region outside the semantic segmentation in the tracklets. Appearance cost is defined as:

$$\begin{aligned} E_{a}(l) = \sum _{p\in \psi _S} -log P(I_{p}|l_{p}) \end{aligned}$$

where \(P(I_{p}|l_{p} = l_i)\) is the probability of pixel p in the reference image belonging to label \(l_i\). Color models use GMMs with 10 components each for foreground or background.

Matching cost\(E_{d}(d)\): The photo-consistency matching cost across views is defined as:

$$\begin{aligned} E_{d}(d) = \sum _{p\in \psi _S} e_{d}(p, d_{p}) \end{aligned}$$

where \(e_{d}(p, d_{p}) = \sum _{i \in {\mathscr {O}}_{k}}m(p,r)\), if \(d_{p}\ne {\mathscr {U}}\) else \(M_{{\mathscr {U}}}\). m(pr) is inspired from Hu and Mordohai (2012). \(M_{{\mathscr {U}}}\) is the fixed cost of labelling a pixel unknown. r denotes the projection of the hypothesised point P in an auxiliary camera where P is a 3D point along the optical ray passing through pixel p located at a distance \(d_{p}\) from the reference camera. \({\mathscr {O}}_{k}\) is the set of k most photo-consistent pairs with a reference camera across views. \({\mathscr {O}}_{k}\) are identified using the highest number or feature matches spatially across frames.

Motion cost\(E_{m}(l,m)\): This adds the brightness consistency assumption to the cost function generalized for spatial and temporal neighbourhood, defined as:

$$\begin{aligned} E_{m}(l,m)= & {} \sum _{{p} \in \psi _T} \lambda _{l}E_{l}(p,m_p,l_p) + \lambda _{c}E_{c}(p,m_p,l_p)\\ E_{l}(p,m_p,l_p)= & {} \sum _{i = 1}^{N_{v}} \left\| (I_{i}(p,t) - I_{i}(p+ m_p, t+1)) \right\| ^{2} \\&\text { if at t and t+1 } l_p=l_{p+m_p} \text { else } 0\\ E_{c}(p,m_p,l_p)= & {} \sum _{p \in \psi _T} \sum _{i = 2}^{N_{v}} \left\| (I_{1}(p,t) - I_{i}(p+ m_p,t)) \right\| ^{2} \\&\text { if at t } l_p=l_{p+m_p} \text { else } 0 \end{aligned}$$

\(E_{l}()\) penalizes deviation from the brightness constancy assumption in time for a single view. Term \(E_{c}()\) penalizes deviation from the brightness constancy assumption between the reference view and each of the other views at other time instants. Here \(N_{v}\) is the number of views at each time frame and \(I_{i}(p,t)\) is the intensity at a given pixel p at time instant t in view i, \(\psi _S\) and \(\psi _T\) are the spatial and temporal neighbourhood.

This term denotes that the flow vector \(m_p\) is located within a window from a sparse constraint at p and it forces the flow to approximate the sparse 2D temporal correspondences.

Smoothness cost\(E_{s}(l,d)\): The surface smoothness cost introduced in Mustafa et al. (2016a) is extended to spatial and temporal neighbourhoods:

$$\begin{aligned} E_{s}(l,d)= & {} \lambda _{s}^{t} \sum _{p,q \in \psi _T} e_{s}(l_p,d_{p},l_q,d_{q},d_{max}^{t}) \\&+\lambda _{s}^{S} \sum _{p,q \in \psi _S} e_{s}(l_p,d_{p},l_q,d_{q},d_{max}^{s})\\&e_{s}(l_p,d_{p},l_q,d_{q},d_{max})\\= & {} {\left\{ \begin{array}{ll} min(\left| d_{p} - d_{q} \right| , d_{max}),&{} \text {if } l_{p} = l_{q} \text { and } d_{p},d_{q}\ne {\mathscr {U}}\\ 0, &{} \text {if } l_{p} = l_{q} \text { and } d_{p},d_{q} = {\mathscr {U}}\\ d_{max}, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

\(d_{max}\) is introduced to avoid over-penalising large discontinuities. \(d_{max}^{s}\) ensures spatial smoothness and \(d_{max}^{t}\) ensures smoothness over time between the temporal neighbourhood of the tracklets and is set to twice \(d_{max}^{s}\) to allow large movement in the object between tracklet frames.

Table 1 The characteristic properties of datasets used for evaluation

Semantically and Temporally Coherent Reconstruction

The estimated dense flow for each view is projected to the 3D visible surface to establish dense 3D correspondence (scene flow) between frames and between semantic tracklets \(T_{i}\) to obtain 4D semantically and temporally coherent dynamic scene reconstruction, as illustrated in Fig. 1. Temporal correspondence is first obtained for the view with maximum visibility of 3D points. To increase surface coverage correspondences are added in order of visibility of 3D points for different views. Dense temporal correspondence is propagated to new surface regions as they appear using the dense flow estimated from joint refinement. Temporal coherence is also estimated between semantic tracklets to overcome the limitations of sequential correspondence propagation by correcting any errors introduced in semantically and temporally coherent reconstruction. As a result along with segmentation and reconstruction of dynamic scenes, we have temporal and semantic per-pixel correspondence information in both 2D and 3D, as shown for Juggler dataset in Fig. 7. The 2D per-view depth maps are combined using Poisson surface reconstruction (Kazhdan et al. 2006), which leads to loss in the details in mesh of the object compared to the semantic segmentation.

4 Results and Evaluation

Joint semantic co-segmentation, reconstruction and scene flow estimation (Sect. 3.3) is evaluated on a variety of publically available multi-view indoor and outdoor dynamic scene datasets, details in Table 1.

Fig. 7
figure 7

3D temporal alignment between frames for Juggler dataset

4.1 4D Flow Evaluation

We evaluate semantic and temporal coherence obtained using the proposed 4D semantic flow algorithm on all of the datasets. Stable long-term 4D correspondence propagation is illustrated using color coded results. First frame of the sequence is color coded and the colors are propagated between frames using the 2D-3D motion information obtained from the joint refinement explained in Sect. 3.3. Results of the proposed 4D temporal and semantic alignment, illustrated in Fig. 8 shows that the colour of the points remains consistent between frames. The proposed approach is qualitatively shown to propagate the correspondences reliably for complex dynamic scenes with large non-rigid motion.

For comparative evaluation we use:(a) state-of-the-art dense flow algorithm Deepflow (Weinzaepfel et al. 2013; b) a recent algorithm for alignment of partial surfaces (4DMatch) (Mustafa et al. 2016b) and (d) Simple flow (Tao et al. 2012). Qualitative results against 4DMatch, Deepflow and Simpleflow shown in Fig. 9 indicate that the propagated colour map does not remain consistent across the sequence for large motion as compared to the proposed method (red regions indicate correspondence failure).

For quantitative evaluation we compare the silhouette overlap error (SOE). Dense correspondence over time is used to create propagated mask for each image. The propagated mask is overlapped with the silhouette of the projected surface reconstruction at each frame to evaluate the accuracy of the dense propagation. The error is defined as:

$$\begin{aligned} SOE = \frac{1}{M N}\sum _{i = 1}^{N}\sum _{c = 1}^{M} \frac{\text {Area of intersection}}{\text {Area of back-projected mask}} \end{aligned}$$
(8)

Evaluation against the different techniques is shown in Table 2 for all datasets. As observed the silhouette overlap error is lowest for the proposed approach showing relatively high accuracy.

Fig. 8
figure 8

Temporal and semantic coherence results using proposed approach on Handshake, Lightfield and Breakdance datasets. Color-coding for temporal coherence: Unique gradient colors are assigned to first frame of the sequence for each object. Color-coding for semantic coherence: head is red, left-arm is blue, right-arm is green, left-leg is pink and right-leg is violet. Colors are propagated using proposed 4D scene flow (Color figure online)

We evaluate the temporal coherence across the Magician sequence, by evaluating the variation in appearance for each scene point between frames and between semantic tracklets for state-of-the-art methods, defined as: \(\sqrt{\frac{\Delta r^{2} + \Delta g^{2} + \Delta b^{2}}{3}}\), where \(\Delta \) is the difference operator. Evaluation shown in Table 3 against state-of-the-art methods demonstrates the stability of long-term temporal tracking for proposed joint semantic scene flow, co-segmentation and reconstruction.

Fig. 9
figure 9

Dense flow comparison results on different dynamic sequences

Table 2 Silhouette overlap error for multi-view datasets for flow evaluation, where SF is Simpleflow
Table 3 Temporal coherence evaluation for Magician dataset against existing methods
Fig. 10
figure 10

Comparison of segmentation on dynamic datasets from Kim et al. (2012) and Djelouah et al. (2016) against MVVS (Djelouah et al. 2016)

Fig. 11
figure 11

Ground-truth segmentation comparison with TcMVS (Mustafa et al. 2016a) on multi-view datasets

4.2 Segmentation Evaluation

Mutli-view co-segmentation is evaluated against a variety of state-of-the-art methods:

(a) Non-Semantic methods: Multi-view segmentation (MVVS) (Djelouah et al. 2016), Joint segmentation and reconstruction (TcMVS) (Mustafa et al. 2016a), and

(b) Semantic methods: Semantic co-segmentation in videos (SCV) (Zhong and Yang 2016), Mask RCNN (He et al. 2017) and Conditional random field as recurrent neural networks (CRF-RNN) (Zheng et al. 2015).

Proposed segmentation is also evaluated against single-view segmentation methods MVC (Chiu and Fritz 2013) and ObMiC (Fu et al. 2014). These are applied independently on each view for comparison. Comparison against MVVS (Djelouah et al. 2016) is shown in Fig. 10 and evaluation against TcMVS (Mustafa et al. 2016a), SCV (Zhong and Yang 2016) and CRF-RNN (Zheng et al. 2015) are shown in Fig. 12 for dynamic datasets. Ground-truth segmentation comparison with TcMVS (Mustafa et al. 2016a) is shown in Fig. 11. Quantitative evaluation against state-of-the-art methods is measured by Intersection-over-Union with ground-truth, shown in the Table 4. Ground-truth is available online for most of the datasets and obtained by manual labelling for other datasets. The proposed semantically coherent joint multi-view 4D flow, co-segmentation and reconstruction achieves the best segmentation performance against ground-truth for all datasets tested. Results presented in Fig. 12 indicate that the proposed approach accurately segments fine detail such as hands and feet where other approaches are unreliable.

Comparison with Tsai et al. (2016): Comparison to Zhong and Yang (2016) is shown in Fig. 12 and Table 4. The results show that the proposed approach achieves a significant improvement for multi-view video segmentation compared to co-segmentation approach using tracklets (Zhong and Yang 2016) (average 45% improvement in intersection-over-union of the segmentation vs. ground-truth).

Table 4 Segmentation comparison against state-of-the-art methods using the intersection-over-union metric
Fig. 12
figure 12

Comparison of segmentation on public datasets against state-of-the-art methods: TcMVS (Mustafa et al. 2016a) (region in red represents region missing from ground-truth and green represents region not present in ground-truth), CRF-RNN (Zheng et al. 2015) and SCV (Zhong and Yang 2016) (Color figure online)

4.3 Reconstruction Evaluation

The reconstruction results obtained from the proposed approach are compared against state-of-the-art approaches in joint segmentation and reconstruction (TcMVS Mustafa et al. 2016a) and multi-view stereo (Colmap Schönberger et al. 2016, MVE Semerjian 2014, SMVS Langguth et al. 2016). MVE, SMVS and Colmap are state-of-the-art multi-view stereo techniques which do not refine the segmentation. All the methods are initialized with the same initial semantic reconstruction (Sect. 3.1) for fair comparison. Comparison of reconstructions Fig. 13 demonstrates that the proposed method gives consistently more complete and accurate models. Figure 14 presents a comparison to a statistical model-based approach MBR (Rhodin et al. 2016) which reconstructs a single human body shape from the whole sequence together with pose at each frame. This provides a good estimate of the underlying body shape but does not take into account clothing resulting in inaccurate silhouette overlap. Comparison of full scene reconstruction against MVE and SMVS is shown in Fig. 15 showing improved completeness and accuracy.

Fig. 13
figure 13

Comparison of reconstruction of dynamic objects against Colmap (Schönberger et al. 2016), MVE (Semerjian 2014), SMVS (Langguth et al. 2016) and TcMVS (Mustafa et al. 2016a) (same semantic labels are assigned to all methods for fair comparison)

Fig. 14
figure 14

Comparison of reconstruction against MBR (Rhodin et al. 2016) from 4 views of falling down (Kim et al. 2012) dataset

Fig. 15
figure 15

Comparison of full scene reconstruction against SMVS (Schönberger et al. 2016) and MVE (Semerjian 2014) (same semantic labels are assigned to all the approaches for fair comparison)

Joint semantic 4D scene flow, co-segmentation and reconstruction results in a 3D model for which every surface point has consistent surface labelling across all views and over time. To illustrate the semantic wide-timeframe coherence achieved using the proposed approach unique colors are assigned to human body parts in one frame and the colors are propagated using the estimated temporal coherence. The color in different parts of the object remains consistent over time as shown in Fig. 8.

Parameters: Results are insensitive to parameter setting for all indoor and outdoor scenes. Table 5 shows the parameters used, with constant contrast cost \(\lambda _{ca}=\lambda _{cl}=0.5\) and smoothness cost \(\lambda _{s}^{S}=0.4\), \(\lambda _{s}^{T} = 0.6\).

Limitations: The proposed approach is dependent on an initial semantic labelling of the scene for each view obtained using Mask-RCNN. Gross errors or mislabeling may be propagated resulting in incorrect semantic reconstruction, such as the soft-toys labelled as people on the left hand side of the Odzemok dataset Fig. 2. Whilst enforcing semantic coherence is demonstrated to improve scene flow, segmentation and reconstruction for a wide-variety of scenes visual ambiguity in appearance and occlusion may degrade performance.

Table 5 Parameters for all datasets

5 Conclusion

This paper proposes a novel approach to joint semantic 4D scene flow, multi-view co-segmentation and reconstruction of complex dynamic scenes. Temporal and semantic coherence is enforced over long-time frames by semantic tracklets identifying similar frames using the semantic label, appearance, shape and motion information. Tracklets are used for long-term learning to constrain flow per-frame and co-segmentation optimization on general dynamic scenes. Joint optimization simultaneously improves the scene flow, semantic segmentation and reconstruction of the scene by enforcing semantic coherence both spatially across views and temporal across widely-spaced similar frames. Comparative evaluation demonstrates that enforcing semantic coherence achieves significant improvement in scene flow and segmentation of general dynamic indoor and outdoor scenes captured with multiple hand-held cameras. Introduction of space-time semantic coherence in the proposed framework achieves better reconstruction and flow estimation against state-of-the-art methods.