Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Humans use a hammer by holding its handle and striking its head, not vice versa. In this simple action, people demonstrate their understanding of functional parts [33, 39]: a tool, or any object, can be decomposed into primitive-based components, each with distinct physics, functionality, and affordances [17].

How to build a machine of such competency? In this paper, we tackle the problem of physical primitive decomposition (PPD)—explaining the shape and the physics of an object with a few shape primitives with physical parameters. Given the hammer in Fig. 1, our goal is to build a model that recovers its two major components: a tall, wooden cylinder for its handle, and a smaller, metal cylinder for its head.

Fig. 1.
figure 1

A hammer (left) and its physical primitive decomposition (right).

For this task, we need a physical, part-based object shape representation that models both object geometry and physics. Ground-truth annotations for such representations are however challenging to obtain: large-scale shape repositories like ShapeNet [8] often have limited annotations on object parts, let alone physics. This is mostly due to two reasons. First, annotating object parts and physics is labor-intensive and requires strong domain expertise, neither of which can be offered by current crowdsourcing platforms. Second, there exist intrinsic ambiguity in the ground truth: it is impossible to precisely label underlying physical object properties like densities from only images or videos.

Let’s think more about what these representations are for. We want our object representation to faithfully encode its geometry; therefore, it should be able to explain our visual observation of the object’s appearance. Further, as the representation models object physics, it should be effective in explaining the object’s behaviors in various physical events.

Inspired by this, we propose a novel formulation that learns a part-based object representation from both visual observations and physical interactions. Starting with a single image and a voxelized shape, the model recovers the geometric primitives and infers their physical properties from texture. The physical representation inferred this way is of course rather uncertain; it therefore only serves as the model’s prior of this physical shape. Observing object behaviors in physical events offers crucial additional information, as objects with different physical properties behave differently in physical events. This is used by the model in conjunction with the prior to produce its final prediction.

We evaluate our system for physical primitive decomposition in three scenarios. First, we generate a dataset of synthetic block towers, where each block has distinct geometry and physics. Our model is able to successfully reconstruct the physical primitives by making use of both appearance and motion cues. Second, we evaluate the system on a set of synthetic tools, demonstrating its applicability to daily-life shapes. Third, we build a new dataset of real block towers in dynamic scenes, and evaluate the model’s generalization power to real videos.

We further present ablation studies to understand how each source of information contributes to the final performance. We also conduct human behavioral experiments to contrast the performance of the model with humans. In a ‘which block is heavier’ experiment, our model performs comparably to humans.

Our contributions in this paper are three-fold. First, we propose the problem of physical primitive decomposition—learning a compact, disentangled object representation in terms of physical primitives. Second, we present a novel learning paradigm that learns to characterize shapes in physical primitives to explain both their geometry and physics. Third, we demonstrate that our system can achieve good performance on both synthetic and real data.

2 Related Work

Primitive-Based 3D Representations. Early attempts on modeling 3D shapes with primitives include decomposing them into blocks [34], generalized cylinders [6], and geons [5]. This idea has been constantly revisited throughout the development of computer vision [2, 11, 13]. To name a few, Gupta et al. [11] modeled scenes as qualitative blocks, and van den Hengel et al. [13] as Lego blocks. More recently, Tulsaini et al. [40] combined the new and the old—using deep convolutional network to generate primitives of a given 3D shape; later, Zou et al. proposed 3D-PRNN [53], enhancing the flexibility of the system by leveraging modern advancement in recurrent generative models [41].

Fig. 2.
figure 2

Primitive decomposition (a) and physical primitive decomposition (b). Both tasks attempt to convert an object into a set of primitives yet with different purposes: the former problem targets at shape reconstruction, while the latter one aims to recover both geometric and physical properties.

Primitive-based representations have profound impact that goes far beyond the field of computer vision. Scientists have employed this representation for user-interactive design [16] and for teaching robots to grasp objects [29]. In the field of computer graphics, the idea of modeling shapes as primitives or parts has also been extensively explored [2, 19, 21, 27, 47, 50]. Researchers have used the part-based representation for single-image shape reconstruction [15], shape completion [37], and probabilistic shape synthesis [14, 25].

Physical Shape and Scene Modeling. Beyond object geometry, there have been growing interests in modeling physical object properties and scene dynamics. The computer vision community has put major efforts in building rich and sizable databases. ShapeNet-Sem [36] is a collection of object shapes with material and physics annotations within the web-scale shape repository ShapeNet [8]. Material in Context Database (MINC) [4] is a gigantic dataset of materials in the wild, associating patches in real-world images with 23 materials.

Research on physical object modeling dates back to the study of “functional parts” [17, 33, 39]. The field of learning object physics and scene dynamics has prospered in the past few years [1, 3, 7, 18, 20, 23, 26, 30, 32, 38, 48]. Among them, there are a few papers that explicitly build physical object representations [30, 43,44,45, 49]. Though they also focus on understanding object physics [43, 45], functionality [46, 51], and affordances [10, 22, 52], these approaches usually assume a homogeneous object with simple geometry. In our paper, we model an object using physical primitives for richer expressiveness and higher precision.

3 Physical Primitive Decomposition

3.1 Problem Statement

Both primitive decomposition and physical primitive decomposition attempt to approximate an object with primitives. We highlight their difference in Fig. 2.

Fig. 3.
figure 3

Challenges of inferring physical parameters from visual and physical observations: objects with different physical parameters might have (a) similar visual appearance or (b) similar physics trajectory.

Primitive Decomposition. As formulated in Tulsaini et al. [40] and Zou et al. [53], primitive decomposition aims to decompose an object O into a set of simple transformed primitives \(x = \{x_k\}\) so that these primitives can accurately approximate its geometry shape. This task can be seen as to minimize

$$\begin{aligned} \mathcal {L}_\text {G}(x) = \mathcal {D}_\text {S} \big ( \mathcal {S} \big ( \underset{k}{\cup } x_k \big ), \mathcal {S}(O) \big ), \end{aligned}$$
(1)

where \(\mathcal {S}(\cdot )\) denotes the geometry shape (i.e. point cloud), and \(\mathcal {D}_\text {S}(\cdot , \cdot )\) denotes the distance metric between shapes (i.e. earth-mover’s distance [35]).

Physical Primitive Decomposition. In order to understand the functionality of object parts, we require the decomposed primitives \(x = \{x_k\}\) to also approximate the physical behavior of object O. To this end, we extend the previous objective function with an additional physics term:

$$\begin{aligned} \mathcal {L}_\text {P}(x) = \sum _{p \in \mathcal {P}} \mathcal {D}_\text {T} \big ( \mathcal {T}_p \big ( \underset{k}{\cup } x_k \big ), \mathcal {T}_p(O) \big ), \end{aligned}$$
(2)

where \(\mathcal {T}_p(\cdot )\) denotes the trajectory after physics interaction p, \(\mathcal {D}_\text {T}(\cdot , \cdot )\) denotes the distance metric between trajectories (i.e. mean squared error), and \(\mathcal {P}\) denotes a predefined set of physics interactions. Therefore, the task of physical primitive decomposition is to minimize an overall objective function constraining both geometry and physics: \(\mathcal {L}(x) = \mathcal {L}_\text {G}(x) + w \cdot \mathcal {L}_\text {P}(x)\), where w is a weighting factor.

3.2 Primitive-Based Representation

We design a structured primitive-based object representation, which describes an object by listing all of its primitives with different attributes. For each primitive \(x_k\), we record its size \(x^\text {S}_k = (s_x, s_y, s_z)\), position in 3D space \(x^\text {T}_k = (p_x, p_y, p_z)\), rotation in quaternion form \(x^\text {R}_k = (q_w, q_x, q_y, q_z)\). Apart from these geometry information, we also track its physical properties: density \(x^\text {D}_k\).

In our object representation, the shape parameters, \(x^\text {S}_k\), \(x^\text {T}_k\) and \(x^\text {R}_k\), are vectors of continuous real values, whereas the density parameter \(x^\text {D}_k\) is a discrete value. We discretize the density values into \(N_\text {D} = 100\) slots, so that estimating density becomes a \(N_\text {D}\)-way classification. Discretization helps to deal with multi-modal density values. Figure 3a shows that two parts with similar visual appearance may have very different physical parameters. In such cases, regression with an \(\mathcal {L}_2\) loss will encourage the model to predict the mean value of possible densities; in contrast, discretization allows it to give high probabilities to every possible density. We then figure out which candidate value is optimal from the trajectories.

Fig. 4.
figure 4

Overview of our PPD model.

4 Approach

In this section, we discuss our approach to the problem of physical primitive decomposition (PPD). We present an overview of our framework in Fig. 4.

4.1 Overview

Inferring physical parameters from solely visual or physical observation is highly challenging. This is because two objects with different physical parameters might have similar visual appearance (Fig. 3a) or have similar physics trajectories (Fig. 3b). Therefore, our model takes both types of observations as input:

  1. 1.

    Visual Observation. We take a voxelized shape and an image as our input because they can provide us with valuable visual information. Voxels help us recover object geometry, and images contain texture information of object materials. Note that, even with voxels as input, it is still highly nontrivial to infer geometric parameters: the model needs to learn to segment 3D parts within the object—an unsolved problem by itself [40].

  2. 2.

    Physics Observation. In order to explain the physical behavior of an object, we also need to observe its response after some physics interactions. In this work, we choose to use 3D object trajectories rather than RGB (or RGB-D) videos. Its abstractness enables the model to transfer better from synthetic to real data, because synthetic and real videos can be starkly different; in contrast, it’s easy to generate synthetic 3D trajectories that look realistic.

Specifically, our network takes a voxel V, an image I, and \(N_\text {T}\) object trajectories \(\varvec{T} = \{T_k\}\) as input. V is a 3D binary voxelized grid, I is a single RGB image, and \(\varvec{T}\) consists of several object trajectories \(T_k\), each of which records the response to one specific physics interaction. Trajectory \(T_k\) is a sequence of 3D object pose \((p_x, p_y, p_z, q_w, q_x, q_y, q_z)\), where \((p_x, p_y, p_z)\) denotes the object’s center position and quaternion \((q_w, q_x, q_y, q_z)\) denotes its rotation at each time step.

After receiving the inputs, our network encodes voxel, image and trajectory with separate encoders, and sequentially predicts primitives using a recurrent primitive generator. For each primitive, the network predicts its geometry shape (i.e. scale, translation and rotation) and physical property (i.e. density). More details of our model can be found in the supplementary material.

Voxel Encoder. For input voxel V, we employ a 3D volumetric convolutional network to encode the 3D shape information into a voxel feature \(f_\text {V}\).

Image Encoder. For input image I, we pass it into the ResNet-18 [12] encoder to obtain an image feature \(f_\text {I}\). We refer the readers to He et al. [12] for details.

Trajectory Encoder. For input trajectories \(\varvec{T}\), we encode each trajectory \(T_k\) into a low-dimensional feature vector \(h_k\) with a separate bi-directional recurrent neural network. Specifically, we feed the trajectory sequence, \(T_k\), and also the same trajectory sequence in reverse order, \(T_k^\text {reverse}\), into two encoding RNNs, to obtain two final hidden states: \(h^\rightarrow _k = \text {encode}^\rightarrow _k(T_k)\) and \(h^\leftarrow _k = \text {encode}^\leftarrow _k(T_k^\text {reverse})\). We take \([h_k^\rightarrow ; h_k^\leftarrow ]\) as the feature vector \(h_k\). Finally, we concatenate the features of each trajectory, \(\{h_k \mid k = 1, 2, \ldots , N_\text {T}\}\), and project it into a low-dimensional trajectory feature \(f_\text {T}\) with a fully-connected layer.

Primitive Generator. We concatenate the voxel feature \(f_\text {V}\), image feature \(f_\text {I}\) and trajectory feature \(f_\text {T}\) together as \(\hat{f} = [f_\text {V}; f_\text {I}; f_\text {T}]\), and map it to a low-dimensional feature f using a fully-connected layer. We predict the set of physical primitives \(\{x_k\}\) sequentially by a recurrent generator.

At each time step k, we feed the previous generated primitive \(x_{k-1}\) and the feature vector f in as input, and we receive one hidden vector \(h_k\) as output. Then, we compute the new primitive \(x_k = (x^\text {D}_k, x^\text {S}_k, x^\text {T}_k, x^\text {R}_k)\) as

$$\begin{aligned} \begin{aligned} x^\text {D}_k&= \text {softmax}(W_\text {D} \times h_k + b_\text {D}), \quad x^\text {S}_k = \text {sigmoid}(W_\text {S} \times h_k + b_\text {S}) \times C_\text {S}, \\ x^\text {T}_k&= \tanh (W_\text {T} \times h_k + b_\text {T}) \times C_\text {T}, \quad x^\text {R}_k = \frac{W_\text {R} \times h_k + b_\text {R}}{\max (||W_\text {R} \times h_k + b_\text {R}||_2, \epsilon )}, \end{aligned} \end{aligned}$$
(3)

where \(C_\text {S}\) and \(C_\text {T}\) are scaling factors, and \(\epsilon = 10^{-12}\) is a small constant for numerical stability. Equation eqn:primitive guarantees that \(x^\text {S}_k\) is in the range of \([0, C_\text {S}]\), \(x^\text {T}_k\) is in the range of \([-C_\text {T}, C_\text {T}]\), and \(||x^\text {R}_k||_2\) is 1 (if ignoring \(\epsilon \)), which ensures that \(x_k\) will always be a valid primitive. In our experiments, we set \(C_\text {S} = C_\text {T} = 0.5\), since we normalize all objects so that they can fit in unit cubes. Also note that, \(x^\text {D}_k\) is an \((N_\text {D} + 2)\)-dimensional vector, where the first \(N_\text {D}\) dimensions indicate different density values and the last two indicate the “start token” and “end token”.

Sampling and Simulating with the Physics Engine. During testing time, we treat the predicted \(x^\text {D}_k\) as a multinomial distribution, and we sample multiple possible predictions from it. For each sample, we use its physical parameters to simulate the trajectory with a physics engine. Finally, we select the one whose simulated trajectory is closest to the observed trajectory.

An alternative way to incorporate physics engine is to directly optimize our model over it. As most physics engines are not differentiable, we employ REINFORCE [42] for optimization. Empirically, we observe that this reinforcement learning based method performs worse than sampling-based methods, possible due to the large variance of the approximate gradient signals.

Simulating with a physics engine requires we know the force during testing. Such an assumption is essential to ensure the problem is well-posed: without knowing the force, we can only infer the relative part density, but not the actual values. Note that in many real-world applications such as robot manipulation, the external force is indeed available.

4.2 Loss Functions

Let \(x = (x_1, x_2, \ldots , x_n)\) and \(\hat{x} = (\hat{x}_1, \hat{x}_2, \ldots , \hat{x}_m)\) be the predicted and ground-truth physical primitives, respectively. Our loss function consists of two terms, geometry loss \(\mathcal {L}_\text {G}\) and physics loss \(\mathcal {L}_\text {D}\):

$$\begin{aligned} \mathcal {L}_\text {G}(x, \hat{x})&= \sum _k \left( \omega _\text {S} \cdot ||x^\text {S}_k - \hat{x}^\text {S}_k||_1 + \omega _\text {T} \cdot ||x^\text {T}_k - \hat{x}^\text {T}_k||_1 + \omega _\text {R} \cdot ||x^\text {R}_k - \hat{x}^\text {R}_k||_1 \right) , \end{aligned}$$
(4)
$$\begin{aligned} \mathcal {L}_\text {P}(x, \hat{x})&= - \sum _k \sum _i \hat{x}^\text {D}_k (i) \cdot \log x^\text {D}_k(i), \end{aligned}$$
(5)

where \(\omega _\text {S}\), \(\omega _\text {T}\) and \(\omega _\text {R}\) are weighting factors, which are set to 1’s because \(x^\text {S}\), \(x^\text {T}\) and \(x^\text {R}\) are of the same magnitude (\(10^{-1}\)) in our datasets. Integrating Equation eqn:loss:g and Equation eqn:loss:p, we define the overall loss function as \(\mathcal {L}(x, \hat{x}) = \mathcal {L}_\text {G}(x, \hat{x}) + w \cdot \mathcal {L}_\text {P}(x, \hat{x})\), where w is set to ensure that \(\mathcal {L}_G\) and \(\mathcal {L}_P\) are of the same magnitude.

Part Associations. In our formulation, object parts (physical primitives) follow a pre-defined order (e.g., from bottom to top), and our model is encouraged to learn to predict the primitives in the same order.

5 Experiments

We evaluate our PPD model on three diverse settings: synthetic block towers where blocks are of various materials and shapes; synthetic tools with more complex geometry shapes; and real videos of block towers to demonstrate our transferability to real-world scenario.

5.1 Decomposing Block Towers

We start with decomposing block towers (stacks of blocks).

Block Towers. We build the block towers by stacking variable number of blocks (2–5 in our experiments) together. We first sample the size of each block and then compute the center position of blocks from bottom to top. For the \(k^\mathrm{th}\) block, we denote the size as \((w_k, h_k, d_k)\), and its center \((x_k, y_k, z_k)\) is sampled and computed by \(x_k \sim \mathcal {N}(x_{k-1}, w_{k-1}/4)\), \(y_k \sim \mathcal {N}(y_{k-1}, h_{k-1}/4)\), and \(z_k = z_{k-1} + (d_{k-1} + d_k) / 2\), where \(\mathcal {N}(\mu , \sigma )\) is a normal distribution with mean \(\mu \) and standard deviation \(\sigma \). We illustrate some constructed block towers in Fig. 5. We perform the exact voxelization with grid size of \(32\times 32\times 32\) by binvox, a 3D mesh voxelizer [31].

Fig. 5.
figure 5

Sample objects in our block towers dataset (left) and qualitative results of our model with different combinations of observations as input (right).

Materials. In our experiments, we use five different materials, and follow their real-world densities with minor modifications. The materials and the ranges of their densities are listed in Table 1. For each block in the block towers, we first assign it to one of the five materials, and then uniformly sample its density from possible values of its material. We generate 8 configurations for each block tower.

Textures. We obtain the textures for materials by cropping the center portion of images from the MINC dataset [4]. We show sample images rendered with material textures in Fig. 5. Since we render the textures only with respect to the material, the images rendered do not provide any information about density.

Table 1. Materials and their real-world density values (unit: \(\times 10^2 \cdot \text {kg}/\text {m}^3\)). Objects made of similar materials (different types of metals) may have different physical properties, while different materials (i.e., stone and metal) may have same physical properties.

Physics Interactions. We place the block towers at the origin and perform four physics interactions to obtain the object trajectories (\(N_\text {T} = 4\)). In detail, we exert a force with the magnitude of \(10^5\) on the block tower from four pre-defined positions \(\{(\pm 1, -1, \pm 1)\}\). We simulate each physics interaction for 256 time steps using the Bullet Physics Engine [9]. To ensure simulation accuracy, we set the time step for simulation to 1/300s.

Metrics. We evaluate the performance of shape reconstruction by the \(\text {F}_1\) score between the prediction and ground truth: each primitive in prediction is labeled as a true positive if its intersection over union (IoU) with a ground-truth primitive is greater than 0.5. For physics estimation, we employ two types of metrics, (i) density measures: top-k accuracy (\(k \in \{1, 5, 10\}\)) and root-mean-square error (RMSE) and (ii) trajectory measure: mean-absolute error (MAE) between simulated trajectory (using predicted the physical parameters) and ground-truth trajectory.

Methods. We evaluate our model with different combinations of observations as input: (i) texture only (i.e., no trajectory, by setting \(f_\text {T} = 0\)), (ii) physics only (i.e., no image, by setting \(f_\text {I} = 0\)), (iii) both texture and physics but without the voxelized shape, (iv) both texture and physics but with replacing the 3D trajectory with a raw depth video, (v) full data in our original setup (image, voxels, and trajectory). We also compare our model with several baselines: (i) predicting the most frequent density in the training set (Frequent), (ii) nearest neighbor retrieval from the training set (Nearest), and (iii) knowing the ground-truth material and guessing within its density value range (Oracle). While all these baselines assume perfect shape reconstruction, our model learns to decompose the shape.

Results. For the shape reconstruction, our model achieves 97.5 in terms of F1 score. For the physics estimation, we present quantitative results of our model with different observations as input in Table 2. We compare our model with an oracle that infers material properties from appearance while assuming ground-truth reconstruction. It gives upper-bound performance of methods that rely on only appearance cues. Experiments suggest that appearance alone is not sufficient for density estimation. From Table 2, we observe that combining appearance with physics performs well on physical parameter estimation, which is because the object trajectories can provide crucial additional information about the density distribution (i.e. moment of inertia). Also, all input modalities and sampling contribute to the model’s final performance.

Table 2. Quantitative results of physical parameter estimation on block towers. Combining appearance with physics does help our model to achieve better estimation on physical parameters, and our model performs significantly better than all other baselines.
Table 3. Comparison between our model and a physics engine based sampling baseline

We have also implemented a physics engine–based sampling baseline: sampling the shape and physical parameters for each primitive, using a physics engine for simulation, and selecting the one whose trajectory is closest to the observation. We also compare with a stronger baseline where we only sample physics, assuming ground-truth shape is known. Table 3 shows our model works better and is more efficient: the neural nets have learned an informative prior that greatly reduces the need of sampling at test time.

5.2 Decomposing Tools

We then demonstrate the practical applicability of our model by decomposing synthetic real-world tools.

Tools. Because of the absence of tool data in the ShapeNet Core [8] dataset, we download the tools from 3D WarehouseFootnote 1 and manually remove all unrelated models. In total, there are 204 valid tools, and we use Blender to remesh and clean up these tools to fix the issues with missing faces and normals. Following Chang et al. [8], we perform PCA on the point clouds and align models by their PCA axes. Sample tools in our dataset are shown in Fig. 6.

Table 4. Quantitative results of physical parameter estimation on tools. Combining visual appearance with physics observations helps our model to perform much better on physical parameter estimation, and compared to all other baselines, our model performs significantly better on this dataset.

Primitives. Similar to Zou et al. [53], we first use the energy-based optimization to fit the primitives from the point clouds, and then, we assign each vertex to its nearest primitive and refine each primitive with the minimum oriented bounding box of vertices assigned to it.

Other Setups. We make use of the same set of materials and densities as in Table 1 and the same textures for materials as described in Sect. 5.1. Sample images rendered with textures are shown in Fig. 6. As for physics interactions, we follow the same scenario configurations as in Sect. 5.1.

Training Details. Because the size of synthetic tools dataset is rather limited, we first pre-train our PPD model on the block towers and then finetune it on the synthetic tools. For the block towers used for pre-training, we fix the number of blocks to 2 and introduce small random noises and rotations to each block to fill the gap between block towers and synthetic tools.

Fig. 6.
figure 6

Sample objects in synthetic tools dataset (left) and qualitative results of our model with different combinations of observations as input (right).

Results. For the shape reconstruction, our model achieves 85.9 in terms of F1 score. For the physics estimation, we present quantitative results in Table 4. The shape reconstruction is not as good as that of the block towers dataset because the synthetic tools are more complicated, and the orientations might introduce some ambiguity (there might exist multiple bounding boxes with different rotations for the same part of object). The physics estimation performance is better since the number of primitives in our synthetic tools dataset is very small (\(\le \)2 in general). We also show some qualitative results in Fig. 6.

5.3 Decomposing Real Objects

We look into real objects to evaluate the generalization ability of our model.

Real-World Block Towers. We purchase totally ten sets of blocks with different materials (i.e. pine, steel, aluminum and copper) from Amazon, and construct a dataset of real-world block towers. Our dataset contains 16 block towers with different configurations: 8 with two blocks, 4 with three blocks, and another 4 with four blocks.

Physics Interaction. The scenario is set up as follows: the block tower is placed at a specific position on the desk, and we use a copper ball (hang by a pendulum) to hit it. In Fig. 7, we show some objects and their trajectories in our dataset.

Video to 3D Trajectory. On real-world data, the appearance of every frame in RGB video is used to extract a 3D trajectory. A major challenge is how to convert RGB videos into 3D trajectories. We employ the following approach:

  1. 1.

    Tracking 2D Keypoints. For each frame, we first detect the 2D positions of object corners. For simplicity, we mark the object corners using red stickers and use a simple color filter to determine the corner positions. Then, we find the correspondence between the corner points from consecutive frames by solving the minimum-distance matching between two sets of points. After aligning the corner points in different frames, we obtain the 2D trajectories of these keypoints.

  2. 2.

    Reconstructing 3D Poses. We annotate the 3D position for each corner point. Then, for each frame, we have 2D locations of keypoints and their corresponding 3D locations. Finally, we reconstruct the 3D object pose in each frame by solving the Perspective-n-Point between 2D and 3D locations using Levenberg-Marquardt algorithm [24, 28].

Fig. 7.
figure 7

Objects and their physics trajectories in six sampled frames from our real-world block towers dataset. As in the last two rows, objects with similar visual appearances may have distinct physical properties that we can only distinguish from their behaviors in physical events.

Fig. 8.
figure 8

Qualitative results (on real-world block towers) of our model with different combinations of observations as input.

Training Details. We build a virtual physics environment, similar to our real-world setup, in the Bullet Physics Engine [9]. We employ it to simulate physics interactions and generate a dataset of synthetic block towers to train our model.

Results. We show some qualitative results of our model with different observations as input in Fig. 8. In the real-world setup, with only texture or physics information, our model cannot effectively predict the physical parameters because images and object trajectories are much noisier than those in synthetic dataset, while combining them together indeed helps it to predict much more accurate results. In terms of quantitative evaluation, our model (with both observations as input) achieves an RMSE value of 18.7 over the whole dataset and 10.1 over the block towers with two blocks (the RMSE value of random guessing is 40.8).

6 Analysis

To better understand our model, we present several analysis. The first three are conducted on synthetic block towers and the last one is on our real dataset.

Learning Speed with Different Supervisions. We show the learning curves of our PPD model with different supervision in Fig. 9. Model supervised by physics observation reaches the same level of performance of model with texture supervision using much fewer training steps (500K vs. 2M). Supervised by both observations, our PPD model preserves the learning speed of the model with only physics supervision, and further improves its performance.

Fig. 9.
figure 9

Learning curves with different observations as input. Our model learns much better and faster when both texture and physics supervisions are available.

Preference over Possible Values. We illustrate the confusion matrices of physical parameter estimation in Fig. 10. Although our PPD model performs similarly either with only texture as input or with physics as input, its preferences over all possible values turn out to be quite different. With texture as input (in Fig. 10a), it tends to guess within the possible values of the corresponding material (see Table 1), while with physics as input (in Fig. 10b), it only makes errors between very close values. Therefore, the information provided by two types of inputs is orthogonal to each other (in Fig. 10c).

Fig. 10.
figure 10

Confusion matrices of physical parameter estimation. The information provided by two types of observations are different: (a) with texture as input, our model tends to guess within the material’s possible density values (see Table 1); (b) with physics as input, our model only makes errors between close values.

Impact of Primitive Numbers. As demonstrated in Table 5, the number of blocks has nearly no influence on the model with texture as input. With physics interactions as input, the model performs much better on fewer blocks, and its performance degrades when the number of blocks starts increasing. The degradation is probably because the physical response of any rigid body is fully characterized by a few object properties (i.e., total mass, center of mass, and moment of inertia), which provides us with limited constraints on the density distribution of an object when the number of primitives is relatively large.

Fig. 11.
figure 11

Human’s, model’s and ground-truth predictions on “which block is heavier”. Our model performs comparable to humans, and its response is correlated with humans.

Table 5. Quantitative results (RMSE’s) on block towers (with different block numbers): (a) with texture as input, our model performs comparably on different block numbers; (b) with physics as input, our model performs much better on fewer blocks.

Human Studies. We select the block towers with two blocks from our real dataset, and study the problem of “which block is heavier” upon them. The human studies are conducted on the Amazon Mechanical Turk. For each block tower, we provide 25 annotators with an image and a video of physics interaction, and ask them to estimate the ratio of mass between the upper and the lower block. Instead of directly predicting a real value, we require the annotators to make a choice on a log scale, i.e., from \(\{2^k \mid k = 0, \pm 1, \ldots , \pm 4\}\). Results of average human’s predictions, model’s predictions and the truths are shown in Fig. 11. Our model performs comparably to humans, and its response is also highly correlated with humans: the Pearson’s coefficient of “Human vs. Model”, “Human vs. Truth” and “Model vs. Truth” is 0.69, 0.71 and 0.90, respectively.

7 Conclusion

In this paper, we have formulated and studied the problem of physical primitive decomposition (PPD), which is to approximate an object with a set of primitives, explaining its geometry and physics. To this end, we proposed a novel formulation that takes both visual and physics observations as input. We evaluated our model on several different setups: synthetic block towers, synthetic tools and real-world objects. Our model achieved good performance on both synthetic and real data.