Weakly Supervised Learning of Multi-Object 3D Scene Decompositions Using Deep Shape Priors

Representing scenes at the granularity of objects is a prerequisite for scene understanding and decision making. We propose PriSMONet, a novel approach based on Prior Shape knowledge for learning Multi-Object 3D scene decomposition and representations from single images. Our approach learns to decompose images of synthetic scenes with multiple objects on a planar surface into its constituent scene objects and to infer their 3D properties from a single view. A recurrent encoder regresses a latent representation of 3D shape, pose and texture of each object from an input RGB image. By differentiable rendering, we train our model to decompose scenes from RGB-D images in a self-supervised way. The 3D shapes are represented continuously in function-space as signed distance functions which we pre-train from example shapes in a supervised way. These shape priors provide weak supervision signals to better condition the challenging overall learning task. We evaluate the accuracy of our model in inferring 3D scene layout, demonstrate its generative capabilities, assess its generalization to real images, and point out benefits of the learned representation.


Introduction
Humans have the remarkable capability to decompose scenes into its constituent objects and to infer object properties such as 3D shape and texture from just a single view. Providing intelligent systems with similar capabilities is a long-standing goal in artificial intelligence. Such representations would facilitate object-level description, abstract reasoning and high-level decision making. Moreover, object-level scene representations could improve generalization for learning in downstream tasks such as robust object recognition or action planning. Learning single image 3D scene decomposition in a self-supervised way is specifically challenging due to common ambiguities with respect to depth, 3D object pose, shape, texture and lighting, for which suitable priors are required.
Previous work on learning-based scene representations focused on single-object scenes (Sitzmann et al., 2019), did not consider the underlying compositional structure of scenes (Eslami et al., 2018;Mildenhall et al., 2020), or neglected to model * * Corresponding author: E-Mail: cathrin.elich@tuebingen.mpg.de the 3D geometry of the scene and the objects explicitly with an interpretable representation (e.g. Greff et al., 2019;Eslami et al., 2016;Stelzner et al., 2021)). In our work, we make steps towards multi-object representations by proposing a network which learns to decompose scenes into objects through weak and self-supervision, and represents 3D shape, texture, and pose of objects explicitly. By this, our approach jointly addresses the tasks of object detection, instance segmentation, object pose estimation and inference of 3D shape and texture in single RGB images. We incorporate prior shape knowledge in the form of pretrained neural implicit shape models to allow for learning of scene decomposition into an interpretable 3D representation through weak supervision.
Inspired by (Park et al., 2019;Oechsle et al., 2019), we represent 3D object shape and texture continuously in function-space as signed distance and color values at continuous 3D locations. The scene representation network infers the object poses and its shape and texture encodings from the input RGB image. We use a differentiable renderer which efficiently generates color and depth images as well as instance masks from the objectwise scene representation. This allows for training our scene representation network in a weakly-supervised way. Using a

Switch Position
Sample Fig. 1. Example scenes with object manipulation. For each example, we input the left images and compute the middle one as standard reconstruction. After the manipulation in the latent space, we obtain the respective right image. Plausible new scene configurations are shown on the Clevr dataset (Johnson et al., 2017) (top) and on composed ShapeNet models (Chang et al., 2015) (bottom).
pre-trained shape space, we train our model to decompose and describe the scene without further annotations like instance segmentation, object poses, texture, or concrete shape from single RGB-D images. Due to the combination of object-level scene understanding and differentiable rendering, our model further facilitates to generate new scenes by altering an interpretable latent representation (see Fig. 1).
We evaluate our approach on both synthetic and real scene datasets with images composed of multiple objects on a planar background. We show its capabilities with shapes such as geometric primitives and vehicles, and demonstrate the properties of our geometric and weakly-supervised learning approach for scene representation.
In summary, we make the following contributions: (1) We propose PriSMONet, a novel model to learn representations of scenes composed of multiple objects with a planar background. Our model describes the scene by explicitly encoding object poses, 3D shapes and texture.
(2) Our model is trained via differentiable rendering to decode the latent representation back into images. We apply a differentiable renderer using samplingbased raycasting for deep SDF shape embeddings which renders color and depth images as well as instance segmentation masks. This setup enables our model to be trained using only weak supervision in form of shape priors and eliminates the need for scene specific object-wise 3D supervision. (3) By representing 3D geometry explicitly, our approach naturally respects occlusions between objects and facilitates manipulation of the scene within the latent space. We demonstrate properties of our geometric model for scene representation and augmentation, and discuss advantages over multi-object scene representation methods which model 3D geometry implicitly.
To the best of our knowledge, our approach is the first to jointly learn object instance detection, instance segmentation, object localization, and inference of 3D shape and texture in a single RGB image via weak and self-supervised scene decomposition. For our current model, we make several assumptions and simplifications to provide insights for this challenging task and to allow for an in-depth evaluation of the applied strategies. In particular, we train and test our model on synthetic scenes with uniformly colored, planar background, and simplified lighting conditions. We also test our model trained with synthetic data on real images. We provide a discussion about current limitations of our model and possible directions for future research in Sec. 4.4.

Related Work
Deep learning of single object geometry. Several recent 3D learning approaches represent single object geometry by implicit surfaces of occupancy or signed distance functions which are discretized in 3D voxel grids (Kar et al., 2017;Tulsiani et al., 2017;Wu et al., 2016;Gadelha et al., 2017;Qi et al., 2016;Jimenez Rezende et al., 2016;Choy et al., 2016;Shin et al., 2018;Xie et al., 2019). Voxel representations typically waste significant memory and computation resources in empty scene parts. This limits their resolution and capabilities to represent fine details. Other methods represent shapes with point clouds (Qi et al., 2017;Achlioptas et al., 2018), meshes (Groueix et al., 2018), deformations of shape primitives (Henderson and Ferrari, 2019) or multiple views (Tatarchenko et al., 2016). In continuous representations, neural networks are trained to directly predict signed distance (Park et al., 2019;Xu et al., 2019;Sitzmann et al., 2019), occupancy Chen and Zhang, 2019), or texture  at continuous query points. We use such representations for individual objects. Deep learning of multi-object scene representations. Selfsupervised learning of multi-object scene representations from images recently gained significant attention in the machine learning community. MONet ) presents a multi-object network which decomposes the scene using a recurrent attention network and an object-wise autoencoder. It embeds images into object-wise latent representations and overlays them into images with a neural decoder.  improve upon this work. (Greff et al., 2019) use iterative variational inference to optimize object-wise latent representations using a recurrent neural network. SPAIR (Crawford and Pineau, 2019) and SPACE (Lin et al., 2020) extend the attend-inferrepeat approach  by laying a grid over the image and estimating the presence, relative position, and latent representation of objects in each cell. In GENESIS (Engelcke et al., 2020), the image is recurrently encoded into latent codes per object in a variational framework. (Locatello et al., 2020) propose Slot Attention for decomposing scenes into objects. In contrast to our method, the above methods do not represent the 3D geometry of the scene explicitly.
Related to our approach are also generative models like (Liao et al., 2020;Nguyen-Phuoc et al., 2020) which generate novel 3D scenes but do not explain input views like we do. GIRAFFE (Niemeyer and Geiger, 2021) proposes a generative model for scene composition based on neural radiance fields (NeRF (Mildenhall et al., 2020) and appearance latents of objects. Different to ours, the method does not decompose images into 3D object descriptions. Recently, (Stelzner et al., 2021) decompose a scene into objects using Slot Attention and condition a NeRF-based decoder on a latent code to vary object shape and appearance. Their model does encode object position and rotation implicitly and does not provide an explicit interpretable 3D parametrization like our method. Other methods exploit multiple images to describe 3D scenes (Henderson and Lampert, 2020;Li et al., 2020). Scene decomposition in 3D from a single view, however, is significantly more difficult and requires certain assumptions like prior shape knowledge to be trainable in a self-supervised way.
Supervised learning for object instance segmentation, pose and shape estimation. Loosely related are supervised methods that segment object instances (Ren et al., 2015;Redmon et al., 2016;Hou et al., 2019), estimate their poses (Xiang et al., 2017) or recover their 3D shape (Gkioxari et al., 2019;Kniaz et al., 2020). In Mesh R-CNN (Gkioxari et al., 2019), objects are detected in bounding boxes and a 3D mesh is predicted for each object. The method is trained supervised on images with annotated object shape ground truth. In contrast to all of them, our method is trained without ground-truth annotations of object pose, segmentation masks, or appearance which our model learns with only weak supervision.

Method
We propose an autoencoder architecture which embeds images into object-wise scene representations (see Fig. 2 for an overview). Each object is explicitly described by its 3D pose and latent embeddings for both its shape and textural appearance. Given the object-wise scene description, a decoder composes the images back from the latent representation through differentiable rendering. We train our autoencoder-like network in a self-supervised way from RGB-D images. Scene Encoding.
The network infers a latent z = z 1 , . . . , z N , z bg which decomposes the scene into object latents z i ∈ R d , i ∈ {1, . . . , N} and a background component z bg ∈ R d bg where d, d bg are the dimensionality of the object and background encodings and N is the object count. Objects are sequentially encoded by a deep neural network z i = g o (I, ∆I 1:i−1 , M 1:i−1 ) (see Fig. 2). We share the same object encoder network and weights between all objects. To guide the encoder to regress the latent representation of one object after the other, we forward additional information about already reconstructed objects. Specifically, we decode the previous object latents into object composition images, depth images and occlusion masks ( I 1:i−1 , D 1:i−1 , M 1:i−1 ) := F(z bg , z 1 , . . . , z i−1 ). They are generated by F using differentiable rendering which we detail in the subsequent paragraph. We concatenate the input image I with the difference image ∆I 1:i−1 := I − I 1:i−1 and occlusion masks M 1:i−1 , and input this to the encoder for inferring the representation of object i.
The object encoding z i = (z i,sh , z i,tex , z i,ext ) decomposes into encodings for shape z i,sh , textural appearance z i,tex , and 3D extrinsics z i,ext (see Fig. 3). The shape encoding z i,sh ∈ R D sh parametrizes the 3D shape represented by a DeepSDF autodecoder (Park et al., 2019). Similarly, the texture is encoded in a latent vector z i,tex ∈ R D tex which is used by the decoder to obtain color values for each pixel that observes the object. Object position p i = (x i , y i , z i ) , orientation θ i and scale s i are regressed with the extrinsics encoding z i,ext = (p i , z cos,i , z sin,i , s i ) .
is parametrized in a world coordinate frame with known transformation T w c from the camera frame. We assume the objects are placed upright and model rotations around the vertical axis with angle θ i = arctan(z sin,i , z cos,i ) and corresponding rotation matrix R i . We use a two-parameter representation for the angle as suggested raycast colorize object encoder object-wise render Fig. 3. Object-wise encoding and rendering. We feed the input image, scene composition images and masks of the previously found objects to an object encoder network g o which regresses the encoding of the next object z i . The object encoding decomposes into shape z i,sh , extrinsics z i,ext and texture latents z i,tex . The shape latent parametrizes an SDF function network Φ which we use in combination with the pose and scale of the object encoded in z i,ext for raycasting the object depth and mask using our differentiable renderer f . Finally, the color of the pixels is found with a texture function network Ψ parametrized by the texture latent.
in . We scale the object shape by the factor s i ∈ [s min , s max ] which we limit in an appropriate range using a sigmoid squashing function. The background encoder g bg := z bg ∈ R d bg regresses the uniform color of the background plane, i.e. d bg = 3. We assume the plane extrinsics and hence its depth image is known in our experiments. Scene Decoding. Given our object-wise scene representation, we use differentiable rendering to generate individual images of objects based on their geometry and appearance and compose them into scene images. An object-wise renderer ( I i , D i , M i ) := f (z i ) determines color image I i , depth image D i and occlusion mask M i from each object encoding independently (see Fig. 3). The renderer determines the depth at each pixel u ∈ R 2 (in normalized image coordinates) through raycasting in the SDF shape representation. Inspired by (Wang et al., 2020), we trace the SDF zero-crossing along the ray by sampling points x j := (d j u, d j ) in equal intervals d j := d 0 + j∆d, j ∈ {0, . . . , N − 1} with start depth d 0 . The points are transformed to the object coordinate system by T o c (z i,ext ) := T o w (z i,ext )T w c . Subsequently, the signed distance φ j to the shape at these transformed points is obtained by evaluating the SDF function network Φ z i,sh , T o c (z i,ext )x j . Note that the SDF network is also parametrized by the inferred shape latent of the object. The algorithm finds the zero-crossing at the first pair of samples with a sign change of the SDF Φ. The sub-discretization accurate location x(u) of the surface is found through linear interpolation of the depth regarding the corresponding SDF values of these points. The depth at a pixel D i (u) is given by the z coordinate of the raycasted point x(u) on the object surface in camera coordinates. If no zero crossing is found, the depth is set to a large constant. The binary occlusion mask M i (u) is set to 1 if a zerocrossing is found at the pixel and 0 otherwise. The pixel color I i (u) is determined using a decoder network Ψ similiar to Φ which receives the texture latent z i,tex of the object and the raycasted 3D point x(u) in object coordinates as inputs and outputs an RGB value, i.e. I i (u) = Ψ z i,tex , T o c (z i,ext )x(u) (cf. ). Note, that albeit object masks are binary and only specify at which pixels color and depth have been rendered for an object, the gradients flow through the rendered depth and colors.
We speed up the raycasting process by only considering pix-els that lie within the projected 3D bounding box of the object shape representation. This bounding box is known since the SDF function network is trained with meshes that are normalized to fit into a unit cube with a constant padding. Note that this rendering procedure is implemented using differentiable operations making it fully differentiable for the shape, color and extrinsics encodings of the object. The scene images, depth images and occlusion masks I 1:n , D 1:n , M 1:n = F(z bg , z 1 , . . . , z n ) are composed from the individual objects 1, . . . , n with n ≤ N and the decoded background through z-buffering. We initialize them with the background color, depth image of the empty plane and empty mask. Recall that the background color is regressed by the encoder network. For each pixel u, we search the occluding object i with the smallest depth at the pixel. If such an object exists, we set the pixel's values in I 1:N , D 1:N , M 1:N to the corresponding values in the object images and masks. Training. We use pre-trained deep SDF models as a shape prior in our approach which were trained from a collection of meshes from different object categories similar to (Park et al., 2019). Note that the pretrained shape space of multiple object categories is a very weak prior for object detection and objectwise scene decomposition which our model learns in a selfsupervised manner. Our multi-object network is trained from RGB-D images containing example scenes composed of multiple objects. To this end, we minimize the total loss function which is a weighted sum of multiple sub-loss functions: In particular, L I is the mean squared error on the image reconstruction with Ω being the set of image pixels and I gt the ground-truth color image. The depth reconstruction loss L D penalizes deviations from the ground-truth depth D gt . We apply Gaussian smoothing G(·) to spread the gradients over the rendered image. We decrease the standard deviation over time to allow the network to learn to decompose the scene in a coarse-to-fine manner. L sh regularizes the shape encoding to stay within the training regime of the SDF network. Lastly, L gr favors objects to reside above the ground plane with z i being the coordinate of the object in the world frame, z i the corresponding projection onto the ground plane, and φ i ( The shape regularization loss is scheduled with time-dependent weighting. This prevents the network from learning to generate unreasonable extrapolated shapes in the initial phases of the training, but lets the network refine them over time.  (Johnson et al., 2017). Our multi-object scene representation decouples objects from the background and assigns object-wise instance label, geometry, appearance, and pose.
We use a CNN for both the object and the background encoder. Both consist of multiple convolutional layers with kernel size (3, 3) and strides (1, 1) each followed by ReLU activations and (2, 2) max-pooling. The subsequent fully-connected layers yield the encodings for objects and background. Similar to (Park et al., 2019), we use multi-layer fully-connected neural networks for the shape decoder Φ and texture decoder Ψ. Further details are provided in the supplementary material.

Experiments
Datasets. We provide extensive evaluation of our approach using synthetic scenes based on the Clevr dataset (Johnson et al., 2017) and scenes generated with ShapeNet models (Chang et al., 2015). The Clevr-based scenes contain images with a varying number of colored shape primitives (spheres, cylinders, cubes) on a planar single-colored background. We modify the data generation of Clevr in a number of aspects: (1) We remove shadows and additional light sources and only use the Lambertian rubber material for the objects' surfaces as our decoder is by design not able to generate shadows. (2) To increase shape variety, we apply random scaling along the principal axes of the primitives. (3) An object might be completely hidden behind another one. Hence, the network needs to learn to hide occasional objects. We generate several multi-object datasets. Each dataset contains scenes with a specific number of objects which we choose from two to five. Each dataset consists of 12.5K images with a size of 64×64 pixels. Objects are randomly rotated and placed in a range of [−1.5, 1.5] 2 on the ground plane while ensuring that any two objects do not intersect. Additionally to the RGB images, we also generate depth maps for training as well as instance masks for evaluation. The images are split into subsets of (9K/1K/2.5K) examples for training, validation, and testing. For the pre-training of the DeepSDF (Park et al., 2019) network, we generate a small set of nine shapes per category with different scaling along the axes for which we generate ground truth SDF samples. Different to (Park et al., 2019), we sample a higher ratio of points randomly in the unit cube instead of close to the surface. We also evaluate on scenes depicting either cars or armchairs as well as a mixed set consisting of mugs, bottles and cans (tabletop) from the ShapeNet model set. Specifically, we select 25 models per setting which we use both for pre-training the DeepSDF as well as for the generation of the multi-object datasets. We render (18K/2K/5K) images per object category. For additional evaluation, we further rendered an additional multi-object testset using 25 previously unseen models. Network Parameters. For the Clevr / ShapeNet datasets, the object latent dimension is set to D sh = 8/16 and D tex = 7/15. The shape decoder is pre-trained for 10K epochs. We linearly decrease the loss weight λ sh from 0.025/0.1 to 0.0025/0.01 during the first 500K iterations. The remaining weights are fixed to λ I = 1.0, λ depth = 0.1/0.05, λ gr = 0.01. We add Gaussian noise to the input RGB images and clip depth maps at a distance of 12. The renderer evaluates at 12 steps per ray. Gaussian smoothing is applied with kernel size 16 and linearly decreasing sigma from 16 3 to 1 2 in 250K steps. We trained models with ADAM optimizer (Kingma and Ba, 2014), learning rate 10 −4 , and batch size 8 for 500/400 epochs. Training on the Clevr dataset with 3 objects takes about 2 days on a RTX2080Ti. Evaluations Metrics. We evaluate the learning of object-level 3D scene representations using measures for instance segmentation, image reconstruction, and pose estimation. To evaluate our models' capability to recognize objects that best explain the input image, we consider established instance segmentation metrics. An object is counted as correctly segmented if the intersection-over-union (IoU) score between ground truth and predicted mask is higher than a threshold τ. To account for occlusions, only objects that occupy at least 25 pixels are taken into account. We report average precision (AP 0.5 ), average recall (AR 0.5 ), and F1 0.5 -score for a fixed τ = 0.5 over all scenes as well as the mean AP over thresholds in range [0.5, 0.95] with stepsize 0.05 similiar to (Everingham et al., 2010). We further list the ratio of scenes were all visible objects were found w.r.t. τ = 0.5 (allObj).
Next, we evaluate the quality of both the RGB and depth reconstruction of the generated objects. To assess the image reconstruction, we report Root Mean Squared Error (RMSE), Structural SIMilarity Index (SSIM) (Wang et al., 2004), and Peak Signal-to-Noise Ratio (PSNR) scores (Wang and Bovik, 2009). For the object geometry, we compute similar to (Eigen et al., 2014) the Absolute Relative Difference (AbsRD), Squared Relative Difference (SqRD), as well as the RMSE for the predicted depth. Furthermore, we report the error on the estimated  (Johnson et al., 2017). The combination of our proposed loss with Gaussian blur is essential to guide the learning of scene decomposition and object-wise representations. We highlight best (bold) and second best (underlined) results for each measure among the full model and the variations where we left out individual components for ablation. Specifying the maximum numbers of objects, we further train our model on scenes with 2, 4, or 5 objects. Despite the increased difficulty for a larger number, our model recognizes most objects in scenes with two to five objects. Models trained with fewer objects can successfully explain scenes with a larger number of objects (# obj=o train /o test ).
objects' position (mean) and rotation (median, sym.: up to symmetries) for objects with a valid match w.r.t. τ = 0.5. We show results over five runs per configuration and report the mean.

Clevr Dataset
In Fig. 4, we show reconstructed images, depth and normal maps on the Clevr (Johnson et al., 2017) scenes. Our model provides a complete reconstruction of the individual objects although they might be partially hidden in the image. The network can infer the color of the objects correctly and gets a basic idea about shading (e.g. that spheres are darker on the lower half). The shape characteristics such as extent, edges or curved surfaces are well recognized. As our model needs to fill all object slots, we sometimes observed that it fantasizes and hides additional objects behind others. Some reconstruction artifacts at object boundaries are due to rendering hard transitions between objects and background. Ablation Study. We evaluate various components of our model on the Clevr dataset with three objects. In Table 1, we compare training settings where we left out each of the loss functions and also demonstrate the benefit of applying Gaussian smoothing (denoted by G) as well as the effect of noise on depth maps.
At the beginning of training, the shape regularization loss is crucial to keep the shape encoder close to the pre-trained shape space and to prevent it from diverging due to the inaccurate pose estimates of the objects. Applying and decaying Gaussian blur distributes gradient information in the images beyond the object masks and allows the model to be trained in a coarse-tofine manner. This helps the model to localize the various objects in the scene. The depth loss is essential for learning the scene decomposition. Without this loss, the network can simply describe several objects using a single object with more complex texture. The usage of the ground loss prevents the model from fitting objects into the ground plane. The image reconstruction loss plays only a minor part for the scene decomposition task but is merely responsible for learning the texture of the objects. Visualizations of these findings can be seen in Fig. 5. Using all our proposed loss functions yields best results over all metrics.
We observe only a slight decrease in performance when training on noisy depth maps. For this experiment, we added Gaus-

Input View
Decomposition Normals Input Prediction All Instances GT Prediction Qualitative results for ablation study. Typical failure cases can be observed when leaving out individual components of our model. The combination of all or proposed loss functions is necessary to obtain a reasonable decomposition into the individual objects as well as meaningful objectwise representations which allow an appropriate scene reconstruction.
sian noise with standard deviation σ = η · d 2 to the depth maps (η = 0.001, pixel-wise depth d). This indicates, that our model is able to learn from non-perfect depth maps. Remarkably, our model is able to find objects at high recall rates (0.942 AR at 50% IoU).
Manipulation. Our 3D scene model naturally facilitates generation and manipulation of scenes by altering the latent representation. In Fig. 1, we show example operations like switching the positions of two objects, changing their shape, or removing an entire object. The explicit knowledge about 3D shape also allows us to reason about object penetrations when generating new scenes. Specifically, we evaluate an object intersection loss L int on the newly sampled scenes to filter out those that turn out to be unrealistic due to an intersection between objects:.  Table 2. Comparison to 2D Baselines. Genesis shows decent results on both the decomposition and reconstruction task but is overall weaker than our method. SA performs better on RGB reconstruction, but worse on most instance segmentation measures because many background pixels are assigned to object slots while our model naturally differentiates objects and background. The used implementation MONet failed in decomposing the scene into the individual objects. In contrast to ours, none of the baseline methods do predict any explicit 3D information.
where i, j are object indices and x k are K sample points distributed evenly between the object centers.
Object Count. We demonstrate generalization to different maximum numbers of objects in Tab. 1. The model is trained with the respective number of objects in the dataset (o train ). Due to the setup of our dataset, it might happen that objects are occluded and thus not visible in the image. This enforces the model to learn to hide spare objects behind another one. On average, our model finds and describes objects in less crowded scenes more easily, while it still performs with high accuracy for five objects.
Besides evaluating the trained networks on scenes with equal settings, we also examine its transferability to scenes with a different number of objects. Due to the sequential architecture of our model, it can even be extended to parse scenes with more objects than it has been trained for (o test ). As we use a shared encoder for all objects, we can simply reset the number of encoding rollouts to the number of objects in the test data. Note that we assume the maximum number of objects to be known. Although our model would be able to hide redundant objects behind already reconstructed ones without this explicit change, it cannot reconstruct additional objects. Our model yields reasonable results, but performs best for similar object numbers in training and testing. The achieved AR 0.5 and allObj measures indicate that the model is able to detect the objects at good rates. For instance, for #obj=3/5, our model finds 71% of all objects (AR 0.5 ) and can explain the full scene in about 21% cases. Qualitative results can be seen in Fig. 6.
Comparison to 2D Baselines. We compare our method to the 2D multi-object scene representation approaches MONet , Genesis (Engelcke et al., 2020), and Slot Attention (SA) (Locatello et al., 2020) Fig. 7. Comparison to 2D Baselines. The used implementation of MONet showed difficulties to decompose the scenes and, instead, the network would describes an entire scene using a single slot only. Genesis is able to decompose the scene into objects and seperates them cleanly from the background. However, reconstruction results are weaker than ours. While Slot Attention (SA) yields a good RGB reconstruction, it often mixes object masks with the background. Due to explicit rendering of 3D shapes, our model naturally differentiates between individual objects and background.
used provided code 1 (adapted for 64x64 images and #objs+bg slots) with original hyperparameters for the original Clevr setup and trained it on our dataset. In case of SA, we obtained masks by assigning each pixel to the slot with highest attention value. For evaluation, we use both our metrics and the Adjusted Rand Index (ARI) (Rand, 1971;Hubert and Arabie, 1985) which measures clustering similarity and was used in (Locatello et al., 2020). We consider both the full ARI score and their variant limited to the ground truth foreground pixels (ARI-FG).
Our experiments with MONet did not yield any sufficient results as the model would simply use a single object slot to describe the entire scene. SA's low instance segmentation scores result from a high number of background pixels in the object masks which becomes especially clear when comparing the high difference in performance for ARI and ARI-FG. Genesis is able to decompose the scene into objects but reconstruction are worse than SA or ours. Due to the usage of shape priors, our model is naturally restricted to produce a reasonable foreground/ background decomposition. In contrast to our method, none of the others estimate any 3D information (e.g. shape or pose). Furthermore, their object representation is not interpretable and does not allow intuitive manipulation of the scene.  Table 3. Evaluation on scenes with ShapeNet objects (Chang et al., 2015). Results for scenes containing objects from different categories. We differentiate between scenes that consist of shapes that were seen during training and novel objects. We show mean and best outcome over five runs.

ShapeNet Dataset
Our composed multi-object variant of ShapeNet (Chang et al., 2015) models is more difficult in shape and texture variation than Clevr (Johnson et al., 2017). For some object categories such as cups or armchairs, training can converge to local minima. We report mean and best results over five training runs in Tab. 3, where the best run is chosen according to F1 score on the validation set. Evaluation is performed on two different test sets: scenes containing (1) object instances with shapes and textures used for training and (2) unseen object instances. We show several scene reconstructions in Fig. 8.
For the cars, our model yields consistent performance in all runs with comparable decomposition results to our Clevr experiments. However, we found that cars exhibit a pseudo-180-degree shape symmetry which was difficult for our model to differentiate. Especially for small objects in the background, it favors to adapt the texture over rotating the object. For the armchair shapes, our model finds local minima in pseudo-90-degree symmetries. The median rotation error indicates better than chance prediction for the correct orientation. Rotation error histograms can be found in the supplementary material. For approximately correct rotation predictions, we found that our model was able to differentiate between basic shape types but often neglected finer details like thin armrests which are difficult to differentiate in the images.
Our tabletop dataset provides another type of challenge: the network needs to distinguish different object categories with larger shape and scale variation. For this setting, we added further auxiliary losses to penalize object intersections (Eq. 2) as well as object positions outside of the image view: Fig. 9. Novel view renderings. Our model is able to generate new scene renderings for largely rotated camera views from just a single input RGB image. While we noticed a reduced texture accuracy for unseen object parts, the normal maps demonstrate that our model obtains a good 3D structural understanding of the scene.
Our model is able to predict the different shape types with coarse textures. On scenes with instances that were not seen during training, our model often approximates the shapes with similar training instances. As can be expected, results are slightly worse compared to the evaluation on shapes known from training. Nevertheless, our model is still able to generate a reasonable scene decomposition using similar objects from the training set which demonstrates the generalization capability of our network. Novel Views. Due to the learned 3D structure, our model is able to render novel views from a scene given a single image (see Fig. 9). Although our model never saw multiple views of the same scene during training and is not tuned for this task, we obtain reasonable results for both scene geometry and appearance. We observe a lower texture reconstruction accuracy for invisible scene parts. Supervised Training. We examine the benefits of using additional supervision for training. Specifically, we utilize ground truth annotations for either (1) 3D object poses or (2) 2D foreground/ background segmentation masks (Tab. 4).
For the first variant, we consider known 3D position, rotation around z-axis, and scale. To account for object order invariance, we determine object matches (z i,ext , z gt m(i),ext ) where each predicted object is assigned to a ground truth object such that every ground truth object is matched exactly once and the summed euclidean distance between pairwise predicted and ground truth object is minimal. With z i,ext = (p i , θ i , s i ) , we use the following additional loss function during training: and l scale (s i , s j ) = (s i − s j ) 2 . We observe that supervision on ground truth 3D object poses helps our model over all categories to reliably decompose the scene into the constituent objects and to achieve improved accuracy on the pose estimation. We also note that this type of supervision helps our model to overcome local minima due to pseudo-symmetry. The main drawback of using 3D poses for supervision is that this kind of annotation for real 2D images is very expensive. For the second variant, we consider the combined foreground masks M 1:N , M gt for predicted and ground truth objects, apply Gaussian smoothing like for the image and depth reconstruction losses, and use binary cross entropy for computing the loss: This loss significantly helped for the tabletop dataset and also yielded improvements for car objects regarding the RGB and depth reconstruction measures compared to the unsupervised setup. In contrast, the performance on the chair dataset decreased. Especially, we observed that our model often only was able to detect two of the three objects and missed smaller objects in the background which leads to a low AR 0.5 score. This indicates that supervision on the foreground mask does not yield a sufficient training signal to always overcome local minima. However, this kind of supervision can still be interesting due to lower cost for annotation. Fig. 10. Evaluation on real images. We show results on real images by our model that was trained on synthetic data. We notice that our model is able to capture the coarse scene layout and shape properties of the objects. However, challenges arise due to domain, lighting, camera intrinsics and view point changes indicating interesting directions for future research.

Input Prediction Input Prediction
Input Prediction GT Prediction failure cases Fig. 11. Parsing real images of block towers (Lerer et al., 2016). We trained our model on synthetic images of stacked cubes and test on real images. Our model recognizes the scene configuration well, but occasionally objects are missed, especially if they are close to the image boundary.

Real Data
We further evaluated our model on real images of toy cars and wooden building blocks (see Fig. 10) as well as on the real block tower dataset from (Lerer et al., 2016) (see Fig. 11). For the former dataset, we adjusted brightness and contrast of the photos to visually match the background color of the synthetic data. For the block tower dataset, images were cropped and scaled. Despite different camera and image properties, our model decomposes the scenes into objects and obtains their coarse shape and appearance without any domain adaptation or fine-tuning on real data. Typical observed failure cases include wrong color prediction, difficulties with elongated shapes, and sometimes unrealistic object clusters.

Limitations
We show typical failure cases of our approach in Fig. 12. Self-supervised learning without regularizing assumptions leads typically to ill-conditioned problems. We use a pretrained 3D shape space to confine the possible shapes, impose a multi-object decomposition of the scene, and use a differentiable renderer of the latent representation. In our selfsupervised approach, ambiguities can arise due to the decoupling of shape and texture. For instance, the network can choose to occlude the background partially with the shape but fix the image reconstruction by predicting background color in these areas. Rotations can only be learned up to a pseudo-symmetry by self-supervision when object shapes are rotationally similar and the subtle differences in shape or texture are difficult to differentiate in the image. In such cases, the network can favor to adapt texture over rotating the shape. Depending on the complexity of the scenes and the complex combination of loss terms, training can run into local minima in which objects are moved outside the image or fit the ground plane. Currently, the network is trained for a maximum number of objects. If all objects in the scene are explained, it hides further objects which could be alleviated by learning a stop criterion.

Conclusion
We propose a novel deep learning approach for selfsupervised multi-object scene representation learning and parsing. Our approach infers the 3D structure of a scene from a single RGB image by recursively parsing the image for shape, texture and poses of the objects. A differentiable renderer allows images to be generated from the latent scene representation and the network to be trained self-supervised from RGB-D images. We employ pre-trained shape spaces that are represented by deep neural networks using a continuous function representation as an appropriate prior for this ill-posed problem.
Our experiments demonstrate the ability of our model to parse scenes with various object counts and shapes. We provide an ablation study to motivate design choices and discuss assumptions and limitations of our approach. We show the advantages of our model to reason about the underlying 3D space of a seen scene by performing explicit manipulation on the individual objects or rendering novel views. While using synthetic data allows us to evaluate the design choices of our model in a controlled setup, we also show successful reconstructions of real images. We believe our approach provides an important step towards self-supervised learning of object-level 3D scene parsing and generative modeling of complex scenes from real images. Our work is currently limited to scenes with few objects as well as simple backgrounds and lighting conditions. Future work will address the challenges of more complex scenes.

Supplementary Material Appendix A. Overview
In this supplementary material, we report additional details about our model, training and evaluation proceedings, as well as additional experimental results and insights.
First, we provide more information about the network architecture, additional auxiliary loss functions and the applied parameter settings in Sec. Appendix B. A more detailed listing of the reported metrics can be found in Sec. Appendix C.
Following this, we present further evaluation results on various datasets. In particular, we present more qualitative results on the Clevr dataset (Johnson et al., 2017) (Fig. D.13) as well as ShapeNet scenes (Chang et al., 2015) in Fig. D.18, Fig. D.19, and Fig. D.20. We show a more detailed analysis on our experiments with different object counts in Tab. D.6-D.8 including more qualitative results for these experiments (Fig. D.17). In

Appendix B. Network Architecture & Parameters
We provide detailed information about our network architecture and parameter settings in Tab. B.5.

Appendix C. Evaluation Metrics
Instance Reconstruction. We evaluate the decomposition capability of our model by comparing the predicted object masks M 1:N with the ground truth masks M gt . For each object combination (M i , M gt, j ), the IoU w.r.t. the occupied pixels is determined. We call object o i to be a true positive if there is an object o gt, j for which IoU(M i , M gt, j ) ≥ τ for some threshold τ. All other predicted objects are considered as a false positives. Ground truth objects that were not associated with a prediction are stated to be false negatives. As objects might not be viewable in the image due to occlusion, we only consider masks with a minimum number of 25 occupied pixels. From this, we compute our reported metrics as follows. , where L is the dynamic range of allowable image pixel intensities (Wang and Bovik, 2009). We refer the reader to (Wang et al., 2004) for a detailed explanation of the SSIM metric. We use the scikit-image implementation 2 to compute PSNR and SSIM scores. Depth Reconstruction. Our depth reconstruction evaluation is based on (Eigen et al., 2014) and is evaluated with the following measures.

RMS E( D
Pose Estimation. We evaluate the error on the predicted pose only for objects that were denoted as true positive, i.e. for which we found a valid ground truth object match. Since we are missing the association between object masks and object  Table B.5. Network Parameters. We report the parameter setting that was used for our experiments. Notation: *: same as Clevr, : latent vector is concatenated to input of this layer (see (Park et al., 2019)) poses in our data, we compare each predicted object's position p i to the closest ground truth object (p gt, j ) according to its 3D position. Each ground truth object is assigned at most once in a greedy proceeding.

Appendix D. Additional Results
We provide additional qualitative and quantitative experimental results for both Clevr and ShapeNet dataset. Please refer to the figures and tables for more information about the respective experimental setup.  (Johnson et al., 2017) with three, four, and five objects. Our model is able to decompose the scene into the individual objects. It recognizes basic color appearance and geometric properties like shape type and deformations (best seen in normal map). It is able to infer complete objects although some of them might be partly occluded by others in the input image. In the last two rows we also show failure cases: We found that our model sometimes misinterprets cubes as cylinders which is presumably due to the similarity of their shape and appearance at the image resolution. In few cases, it only detects a low number of objects, predominantly the most significant ones.  (Johnson et al., 2017). We linearly adapt the first object's shape (top) or texture (middle) latent to match each of the other objects' respective representation. Moreover, we manipule the pose of a single object to either rotate it or move it within the scene parallel to either x-or y-axis (bottom). Note that the camera is not parallel to any of these axes. As we reason about objects in 3D, we are able to recognize intersections between objects and exclude invalid scenes (missing images in last row). By doing so, we are able to generate new plausible scenes. For all scenes, we show the original reconstruction in the middle and perform increasing manipulation in the latent space to both left and right direction. Object shapes are best seen in normal maps. We further adapted the camera pose to better align to the real images. Other properties are the same as for our standard Clevr dataset. During training, we applied further data randomization by adding uniform color noise sampled from a gaussian distribution (σ = 0.025). Bottom: For evaluation, we considered the last frame of each sequence containing up to three objects of the dataset from (Lerer et al., 2016). These images were cropped and down-scaled to obtain images with required size. Our model infers reasonable scene decomposition for the most part. The configuration of the objects including their color and pose is well described. However, we observe that it sometimes misses single objects or uses an inaccurate coloring. Albeit our simple background model is not able to reconstruct the shades of the cloth in the background, it is robust against this unfamiliar structure. Overall, we observe a better transfer to the real data domain which we attribute to adapted dataset settings (e.g. camera pose) as well as the applied data augmentation during training. If tested on scenes with larger number of objects, our model is able to detect more object than it has seen during training as can be seen from the AR 0.5 and allObj score. For example, consider the case (#obj=3/5): Our model was trained on three objects only but detects 71% of all objects in five-object scenes and explains 21% of them completely. If it would not extrapolate for more objects than it has seen during training, it would in best case be able to detect 63% of all object and would hardly ever be able to explain the entire scene (< 5%). These numbers consider existing object occlusion in our dataset.

Instance Reconstruction Image Reconstruction
Depth Reconstruction Pose Est.   (Chang et al., 2015) with car models. Our model generates reasonable reconstruction for scenes with both seen (left) and unseen (right) object instances. For the latter case, it describes objects with similar shapes and textures is has seen in training. Typical failure cases are related to a pseudo-180-degree symmetry of the cars that is not distinguished by the model but handled by adapting the texture. In the lower two rows, all cars face in the wrong direction. This is in most cases not obvious from the reconstructed images only.  (Chang et al., 2015) with chair models. For the chair models it is more important to predict the correct rotation to infer a well matching shape than for other models in our datasets. The model still can get trapped in local minima of 90-degree rotation steps where it would rather adapt shape and texture reconstruction instead of the estimated rotation. Due to the low resolution as well as the discrete sampling by the renderer, our model is prone to miss fine structural elements like armrests or thin legs.  (Chang et al., 2015) models. For our mixed dataset, our model needs to predict object shapes from three different categories (mugs, bottles, cans) as well as respective typical size ranges. We found that our model is able to distinguish between the objects based on their typical characteristics. Unseen objects in the second test set are typically replaced by objects known from training which are similar in appearance. Handles of cups as well as thin, long bottlenecks are often neglected by the model. Especially for small objects, the model sometimes misses to reconstruct an object in the scene. The last row shows reconstructions from a failed training run in which only one object can be found.  (Chang et al., 2015) with chair models. We linearly adapt the first object's latent to match each of the other objects' respective representation in either shape alone (top rows) or shape and texture (bottom rows). By this, we are able to generate new plausible scenes. Object shapes are best seen in the normal maps.  As our model reasons about the underlying 3D structure of a given image, it is able to render novel views of a scene. This is possible although our model was trained exclusively from single images. The reconstructed normal maps show that the model learned to reason about the depicted objects in 3D space. It can be observed that our model renders the reverse side of the car objects less accurate than the visible parts. This might be due to limited range in rotation that the model infers due to pseudo-symmetry.

Cars Chairs Tabletop
Fig. D.23. Rotation Prediction on ShapeNet (Chang et al., 2015). From top to bottom: Ground truth and predicted rotation angles for each dataset and resulting rotation angles considering a single trained model. While values for ground truth rotation are naturally uniformly distributed over the entire range of [−π, π] for all scenes, we found that predicted rotation estimates can be spread over a smaller sub-range. Peaks in the histogram for cars (∼ π) and chairs (∼ π 2 , ∼ π) indicate that the model got stuck in local minima where it predicts a rotation up to a pseudo-symmetry. In contrast, it the rotation error for the tabletop scenes is almost uniformly due to the rotational symmetry of the shapes and the capability of adapting the texture.  (Chang et al., 2015). We either use cross entropy on combined masks or determine the best pose assignment between prediction and ground truth to account for object order invariance. Using 3D object poses as additional supervision provides a stable training setup through which our model learns to detect nearly all objects for all datasets (see AR 0.5 and allObj scores). Especially, our model is able to overcome the pseudo-symmetry local minima (see Err rot ). While other improvements on the cars dataset are rather small, we observe significantly better results for both the chairs and tabletop datasets. However, annotating 3D poses in images is relatively expensive. In comparison, 2D foreground masks helped for both the cars and tabletop dataset but led to worse results for the chair dataset. In particular, non of the trained models for the chair scenes was able to reliably detect all three objects. This indicates that supervision on the foreground mask does not yield a sufficient training signal to always overcome local minima. However, this kind of supervision can still be interesting due to lower cost for annotation.