1 Introduction

In the context of Industry 4.0, robotic systems require adaptation to handle unconstrained pick and place tasks, human-robot interaction and collaboration, and autonomous robot movement. These environments and tasks are dependent on methods that perform object detection, object localization, object segmentation, and object pose estimation. To have accurate robotic manipulation, unconstrained pick and place, and scene understanding, accurate object detection and pose estimation methods are needed. These methods are used in other contexts like augmented reality, for example, where badly placed objects into the real-world break the experience of augmented reality. Another application example is the use of augmented reality in the industry to train new and competent workers where virtual objects need to be placed in the correct positions to look like real objects or simulate their placement in the correct positions.

6D pose estimation is a hard problem to tackle due to the possible scene cluttering, illumination variability, object truncations, and different shapes, sizes, textures, and similarities between objects.

We present a new 6D pose estimation method that has low pose estimation error and can be used in real-time. Our method is a complete pipeline that can detect, segment, and estimate the 6D pose of known objects presented in the scene.

Some methods, like MaskedFusion [1] or DenseFusion [2], have two individual traning steps, one to train the detection and segmentation neural network and another for training the neural network responsible for estimating the objects’ pose. This last neural networks needs to have the previous step trained. MPF6D, on the other hand is trained as a single method. Besides having a simpler and faster training MPF6D achieves higher accuracy due to the new architecture that leverages a pyramid neural network to use object features extracted at different resolutions.

The main contributions are:

  • A new high accuracy 6D pose estimation method;

  • A simpler pipeline for 6D pose estimation that can be trained as a single method, this facilitates the use of MPF6D in different scenarios, datasets or type of objects;

  • A fast method that can be used in real-time, achieving 8 frames per second;

  • An experimental evaluation using the standard datasets used for assessing 6D pose estimation methods including a comparison against the current best methods in the area.

Fig. 1
figure 1

Representation of the data-flow thought MPF6D without the Pose Refinement optional step. Note that this flow is for the training phase. At inference time, the mask is generated by our method using the semantic segmentation head

2 Related work

In this section, we present the most relevant literature related to object 6D pose estimation. First, we introduce methods for semantic segmentation, and then we present methods that estimate the object’s pose.

2.1 Semantic segmentation

The U-Net [3] is one of the most used semantic segmentation convolutional neural network. It is composed of two parts, a contracting path to capture context and an expansive path to assign the classes to the pixels based on the context. SegNet [4] has a similar architecture to U-Net where the second half consists of the same structure as in the first half but hierarchically symmetric. SegNet uses a fully convolutional neural network based on the VGG-16 [5] convolutional layers. The FastFCN [6] architecture uses a Joint Pyramid Upsampling module instead of using dilated convolutions to assign classes to the pixels, since those consume more memory and time during training. It has a fully connected network as its backbone and the Joint Pyramid Upsampling is used to upsample the low-resolution feature maps and label the pixels. The Pyramid Scene Parsing Network (PSPNet) [7], uses global contextual information for semantic segmentation. The authors introduced a Pyramid Pooling Module after the last layers of a fully convolutional neural network, that is based on ResNet-18, and the feature maps obtained from the fully convolutional neural network are pooled using four different scales corresponding to four different pyramid levels. The polled feature maps are then convoluted using a 1\(\times \)1 convolution layer to reduce the feature maps dimension. These outputs of each convolution are then upsampled and concatenated with the initial feature maps that were extracted from the fully convolution neural network. This concatenation provides the local and global contextual information of the image. After the concatenation, the authors use another convolution layer to generate the final pixel-wise predictions. The PSPNet objective is to observe the whole feature map in sub-regions with different locations using the pyramid pooling module. The Gated-SCNN [8] architecture consists of a two-stream convolutional neural network architecture. The first stream branch is used to process image shape information, and the second is used to process boundary information. A gating method is used in the intermediate layers of each branch to connect features from both branches. It fuses all the features from both branches and predicts the semantic segmentation masks. DeepLab models [9] address this challenge by using Atrous convolutions and Atrous Spatial Pyramid Pooling modules. The DeepLab architecture has evolved over several generations: DeepLabV1 uses Atrous Convolution and Fully Connected Conditional Random Field to control the resolution at which image features are computed. DeepLabV2 uses Atrous Spatial Pyramid Pooling to consider objects at different scales and segments with improved accuracy. Finally, DeepLabV3, apart from using Atrous Convolution, also uses an improved Atrous Spatial Pyramid Pooling module by including batch normalization and image-level features, and it does not use the Conditional Random Field as in previous versions.

Fig. 2
figure 2

In-depth representation of Pyramid ResNet34. This system is used with both RGB and Mask images to extract features, with only one change between them: the first layer for the mask receives only one channel instead of three

2.2 6D pose estimation

The methods for 6D Pose Estimation can be divided, based on the type of input that they use, into three different categories, RGB, Point Cloud, and RGB-D methods.

Methods like [10,11,12,13,14] that use RGB images as input, rely on the detection and matching of keypoints from the objects with keypoints from the object’s 3D render and then use the PnP [15] algorithm to estimate the 6D pose. In this category, there are other methods, such as [16] that do template matching with cropped patches from the object presented in the image and approximate the 3D model of the objects to the cropped patches to estimate the object’s pose.

point cloud methods [17,18,19,20,21] rely on descriptors to extract object features, and match the extracted features with features acquired from known poses.

RGB-D methods [2, 16, 22,23,24,25,26] regress the 6D poses directly regressed from the input data. Usually, these methods have a pose refinement phase using the Iterative Closest Point algorithm where the depth data is mostly taken into account. In this category, fewer methods [1, 2, 26] use the RGB and depth as input data to achieve better pose accuracies and also use refinement phases to achieve higher accuracy. Tejani et al [27] follow a local approach where small RGB-D patches vote for object pose hypotheses in a 6D space. Kehl et al [16] also follow a local approach but they use a convolutional auto-encoder (CAE) to encode each patch of the object to later match the obtained features in the bottleneck of the CAE with a code-book of features learned during the train and use the code-book matches to predict the 6D pose of the object. Although such methods are not taking global context into account, they proved to be robust to occlusion and the presence of noise artifacts since they infer the object pose using only small patches of the image. SSD-6D [22] uses an RGB image that is processed by the NN to output localized 2D detections with bounding boxes, classifies the bounding boxes into discrete bins of Euler angles and subsequently estimates the object’s 6D pose. This method is in the RGB-D category because after the first estimation, and with the availability of the depth information, the 6D poses can be further refined. PoseCNN [24] uses a new loss function that is robust to object symmetry to directly regress the object rotation. It uses a Hough voting approach to obtain the 3D center of the object to estimate its translation. Using ICP on the refinement phase of SSD-6D and PoseCNN makes their 6D pose estimation more accurate. Li et al [23] formulate a discriminative representation of 6-D pose that enables predictions of both rotation and translation by a single forward pass of a convolutional neural network, and it can be used with many object categories. DenseFusion [2] extract features from RGB images and depth data with different fully convolutional neural network. After the extraction, it fuses the depth and RGB features while retaining the input’s space geometric structure. DenseFusion is similar to PointFusion [25], as it also estimates the 6D pose while keeping the geometric structure and appearance information of the object, to later fuse this information in a heterogeneous architecture. MaskedFusion [1] is a pipeline divided into 3 sub-tasks that combined can solve the task of object 6D pose estimation. In the first sub-task, the detection and segmentation for each object in the scene occur. The neural network presented in the first sub-task classifies each pixel of the RGB image captured and predicts the mask and the location for each object in the scene. After the output of the neural network, filters are applied to the mask and then a bit-wise AND operation is used on the original RGB and depth images to crop the intended object. In the second sub-task, with the masks obtained from sub-task 1 for each object and the RGB-D data, it is possible to estimate the object 6D pose. For each type of input data, the method has different neural networks to extract features. After all the features are extracted they are combined and another neural network is used to extract the most meaningful features and regress the estimated 6D pose of the object. After this sub-task, the 6D pose is estimated, but it is also possible to do pose refinement using another neural network. PVN3D [26], uses DenseFusion as a backbone to extract features of the object and then uses a shared MLP to estimate the object keypoints and segment each object. A clustering algorithm is then used to find the different points of the object. In the end, a least-squares fitting algorithm is used to estimate the 6D pose of the object.

Most of the related methods rely on an object detector or object segmentation method to pre-process the scene and then crop each object to then estimate its 6D pose. Our method, MPF6D, uses RGB-D data and extracts features from both data, RGB and depth images, so it is categorized in the RGB-D category. MPF6D, does not need a pre-processing method to detect the object. It segments the objects in one neural network head while other heads estimate the translation of the object and its rotation.

Fig. 3
figure 3

Pyramid Fusion is the architecture that fuses extracted features from the different data types (RGB, Mask, Depth/Point Cloud)

3 MPF6D

Our method has influences from pyramid neural networks, as we use this type of architecture during the feature extraction and feature fusion so that we can have multiple features that combine low-resolution features, semantically strong features, and high resolution features. With this type of architecture, it was possible to create an accurate and fast method to estimate the 6D pose of objects.

Our architecture has five steps from the input data until we have the estimated pose. The five steps are Semantic Segmentation, Feature Extraction, Feature Fusion, Pose Estimation Heads and Pose Refinement. All five steps are trained simultaneously.

In Fig. 1, we have the overview diagram that represents the data flow through MPF6D, except the optional Pose Refinement step. The flow starts on the top of the image with the RGB, Mask, and Depth images as input. Note that this flow is for the training phase. At inference time, the mask is generated using the semantic segmentation head. Figure 2 represents the pyramid scheme ResNet34 (we named it Pyramid ResNet34) as well as the upscale processes in order to have features from different scales in a pixel-wise form. In Fig. 3, we show the fusion of the multiple extracted features that are used in the estimation heads (Translation Head, Rotation Head).

In the first step, Semantic Segmentation, we used the DeepLabV3 [9] architecture as a head of our method to detect, classify and generate the mask of the known objects presented in the scene. This head is trained using the same loss function that was proposed in [9]. The training is done at the same time as the rest of the method. This was possible due to the integration of this head in our neural network. After the prediction of the mask, we apply a technique introduced by the authors of MaskedFusion [1]: first, we use a median filter to smooth the mask image with a kernel size of \(3\times 3\), and then, we dilate the mask with a \(5\times 5\) kernel such that, if the mask has some minor boundary segmentation error, this operation helps to correct it or complete any misclassified pixel in the middle of the object. Since the 6D pose estimation step of our neural network requires an isolated object as input, we use the mask of the classified object to crop that object, and for that, we needed to add a mechanism that would enable the rest of the neural network to be trained efficiently instead of waiting for the segmentation head to be accurate. This mechanism is only used in the initial training epochs. The mechanism consists in using the ground truth mask in the 6D pose estimation step while the semantic segmentation head does not achieve a low and stable loss. We use the ground truth masks until the threshold mechanism detects that the loss of the segmentation head is stabilizing. Then we start to use masks produced by the segmentation head into the next steps. We apply and analyze the loss threshold in the validation subset. We use a bit-wise AND between the mask and the RGB image and the mask and depth image in order to only have the pixels that have the object present in it. Then we do a rectangular crop of the object from the resultant images. This enables us to have a tensor that has a smaller size than using the full image as input tensor, where most of the pixels were black due to our bit-wise AND operation to remove the background.

For the Feature Extraction step of the RGB and mask data, we use a ResNet34 architecture in a pyramid-like architecture (Fig. 2) where we stacked two of the original ResNet34 architectures to have multi-scale features. This ResNet34 has at the end upscale layers. This type of layer is similar to the ones presented in PSPNet [7]. They consist of convolutional layers mixed with upsampling layers, that enables us to assign features for each original pixel of the object (pixel-wise features). We use the features produced by the first ResNet34 as input to the next ResNet34 and we upscale these features to the original object size after the first ResNet34 and after the second ResNet34, then we fuse both of the upscaled features. The fusion is made with a concatenation of the features tensors and the output of two convolutional layers. These fused features for the RGB data and mask will be used in the pyramid feature fusion.

For the depth data, we convert the cropped depth image into a point cloud, and then we use a PointNet architecture to extract features from the generated point cloud.

The neural network responsible for the feature extraction generates features corresponding to each different data type. For the feature fusion step (Fig. 3), we use a pyramid like architecture with the intent of having multi-dimensional features with different scales that are then upscaled to the original object image size. The pyramid feature fusion has three different resolutions. With this type of technique, which was previously used during the feature extraction of RGB and mask data, we can have the most significant features of the object while keeping the original features that represent the object’s size, geometry, and pixel-wise multi-feature. With all the features fused we can then use the neural network heads to estimate the different values for the position of the object and its rotation. We use two regression heads, one for estimating the translation vector of the object, and the other to estimate the quaternion that corresponds to the rotation of the object. The output of the neural network is a translation vector and a rotation matrix for the known object. After having this preliminary 6D pose of the object, we could use other methods (ICP or DenseFusion refinement) to refine the 6D pose estimation. However, we improved the DenseFusion refinement neural network to use it in our refinement step. We choose to improve upon DenseFusion since their refinement neural network can be used during the inference time without using too much computation time. We added two extra layers to use the same type of pyramid architecture that we used before to maintain the original scale of the features and have deeper features.

To train MPF6D we use the following loss function (1) where we calculate the error between M randomly sampled points and the ground truth object pose:

$$\begin{aligned} \mathcal {L}^{p}_{i} = \frac{1}{M} \sum _{j} \left\| ( Rx_j + t) - (\hat{R_i}x_j + \hat{t_i}) \right\| \end{aligned}$$
(1)

where, \(x_j\) denotes the \(j^{th}\) point of the M randomly selected 3D points from the object’s 3D model, \(p = [R\vert t]\) is the ground truth pose, R is the rotation matrix of the object and t is the translation vector. The estimated pose from MPF6D is represented by \(\hat{p}_i = [ \hat{R}_i\vert \hat{t}_i]\) where \(\hat{R}\) denotes the predicted rotation and \(\hat{t}\) the predicted translation.

All these techniques in conjunction enable us to have and accurate 6D pose estimation while keeping the inference time as low as possible to enable our method to be used in real-world applications. The pipeline outputs the translation vector and a rotation matrix for the known object.

Table 1 Quantitative evaluation of 6D pose using the ADD (2) metric on the LineMOD dataset. Symmetric objects are presented in italic and were evaluated using ADD-S (3)

4 Experiments

We conducted a series of experiments to assess the effectiveness of the MPF6D algorithm in addressing challenges in two datasets. To ensure reliable results, we trained the MPF6D algorithm (code available on https://github.com/kroglice/MPF6D) three times from scratch, starting with random weights each time. We then evaluated the performance of the algorithm by comparing the average results obtained from the three runs and the best results obtained from a single run with previous methods.

The purpose of training the algorithm multiple times was to account for any variability that may occur during the training process and obtain an average performance measure. By presenting both average and best results, we aimed to provide a comprehensive evaluation of the algorithm’s effectiveness. Based on the results, the MPF6D algorithm demonstrated superior performance compared to previous methods in addressing the challenges posed by the two datasets.

For all training and inference experiments, we use a desktop computer with SSD NVME, 64GB of RAM, an NVIDIA GeForce GTX 1080 Ti, and Intel Core i7-7700K CPU.

4.1 Datasets

To evaluate our method, we use two benchmark datasets, LineMOD and YCB-Video. These datasets are widely used by previous state-of-the-art methods. The LineMOD Dataset [28] was captured with a Kinect, and it has the RGB and depth images automatically aligned. The dataset consists of 13 low-textured objects, annotated 6D poses, and object masks. The main challenges of this dataset are the cluttered scenes, texture-less objects, and illumination variations.

The YCB-Video Dataset [29] contains 21 YCB objects of varying shape and texture, and is composed of 92 RGB-D videos, each with a subset of the objects placed in the scene. It has 6D pose annotations and objects masks. The varying lighting conditions, image noise, and occlusions make this dataset a challenge.

4.2 Evaluation metrics

As in previous works [2, 14, 22, 24] for the LineMOD dataset we used the Average Distance of Model Points (ADD) (2) [28] as metric of evaluation for non-symmetric objects and for the egg-box and glue we used the Average Closest Point Distance (ADD-S) (3) [24].

$$\begin{aligned} \text {ADD} = \frac{1}{m} \sum _{x \in M} \left\| ( Rx + t) - (\hat{R}x + \hat{t}) \right\| \end{aligned}$$
(2)
$$\begin{aligned} \text {ADD-S} = \frac{1}{m} \sum _{x_1 \in M} \min _{x_2 \in M} \left\| ( Rx_1 + t) - (\hat{R}x_2 + \hat{t}) \right\| \end{aligned}$$
(3)

In the metrics (2) and (3), assuming the ground truth rotation R and translation t and the estimated rotation \(\tilde{R}\) and translation \(\tilde{t}\), the average distance calculates the mean of the pairwise distances between the 3D model points of the ground truth pose and the estimated pose. M represents the set of 3D model points and m is the number of points. For the symmetric objects, the matching between points is ambiguous for some poses, and that is why the ADD-S metric is used for symmetric objects.

In the YCB-Video evaluation we use the same metrics as in previous works [24] and [2]. So the evaluation as been done with the area under the ADD-S (3) curve (AUC). Using these common metrics enable us to have a direct comparison between our method and previous methods.

4.3 Results: LineMOD

As previously stated, the main challenges of this dataset are the cluttered scenes, texture-less objects, and illumination variations. Even with these challenging conditions, our method had less pose error overall than all previous methods.

As presented in Table 1, with a direct comparison of our method and PVN3D we improved by 0.3%. This value might not be seen as much improvement but since our method, PVN3D, and even MaskedFusion are close to the zero error mark, all slight improvements are hard to get. Extracting features from different resolutions enable a more accurate pose estimation, independent of the camera angle and object distance. MPF6D achieved the best accuracy in 9 objects out of 13, in its best execution. The values presented in the second column of Table 1 correspond to the best run. Each run was trained from scratch where all the weights were initialized randomly. The first column has the average values and the standard deviation that were obtained for all three runs. We can see that even the average of our three repetitions presents better results than any of the competing approaches.

4.4 Results: YCB-Video

In Table 2, we present the quantitative evaluation using the area under the ADD-S (3) curve (AUC). Our method outperforms all previous methods. Comparing it with PVN3D we had 1.96% more area under the curve. In the YCB-Video dataset, our method had the best accuracy for 19 objects out of 21. The two objects where we lose for the PVN3D are basically the same object (clamp) but with different sizes. As with LineMOD, the average of our three repetitions has better average result than any of the other methods.

Table 2 Quantitative evaluation of 6D pose (area under the ADD-S (3) curve (AUC)) on the YCB-Video Dataset

Figure 4 shows two examples of poor performance of object pose estimation on the left and two good performance examples on the right. The two right examples are also good examples of the MPF6D handling object occlusions without losing accuracy.

Fig. 4
figure 4

MPF6D object pose estimation examples. The green dots represent keypoints of the object pose estimation projected onto the RGB image. The top row contains the input RGB images and the bottom row the predicted object poses. The left two columns contain two examples of poor performance and the right two columns of good performance under heavy occlusion

4.5 Inference

MPF6D can be used in real-time applications since it can infer the 6D pose of an object in 0.12 seconds, so it can execute at 8 frames per second. This time was measured from the instant the data (RGB-D) was fed to the method until it produced the 6D pose estimation (translation vector and quaternion that has the rotation representation). We need an extra 0.02 seconds to have the output as a translation vector and a rotation matrix. PVN3D only reports the inference time on the LineMOD dataset, but in the YCB-Video the authors reported the results where they use the ICP to refine the 6D pose. Using the ICP algorithm improves the overall 6D pose estimation, but usually, this algorithm has high computation costs. It is possible to see that PoseCNN used the ICP refinement and just for the refinement it spent 10.4.

In Table 3, we present the quantitative comparison of values measured in seconds of methods that reported their inference times. The fastest method for inference is DenseFusion, but in terms of accuracy, we should compare the two best methods, ours and PVN3D, comparing these two we have 0.05 seconds faster inference time than PVN3D even while using pose refinement.

Table 3 Quantitative inference time

5 Ablation studies

For the ablation studies in our method, we performed three extra experiments (see Fig. 5) in both datasets (LineMOD and YCB-Video). For these experiments, we trained our method for 50 epochs in the LineMOD and YCB-Video datasets and we evaluated the inference output on the test subset of each dataset. All the experiment results executed in the LineMOD dataset are shown in Table 4 and the obtained results for the YCB-Video are shown in Table 5.

Fig. 5
figure 5

Representation of the ablation process used in the ablation studies

Table 4 Ablation studies (using the ADD (2)) on the LineMOD dataset

Ablation study one consists in testing the impact of removing the pyramid from the feature extraction step (Pyramid ResNet34), thus meaning that we only used a single ResNet34 and one upsample layer and then proceed to the fusion layers of the neural network. With this experiment its possible to analyze the influence of our pyramid architecture on the MPF6D backbone feature extraction for the mask and RGB data. Without the Pyramid ResNet34, the method obtained a 15% increase in the error rate when compared to the original MPF6D architecture, thus meaning that having features extracted with multiple resolutions will improve the object 6D pose estimation. The second ablation study focus on the impact of the Pyramid Fusion depth. This study consists in removing one depth level of the Pyramid Fusion, thus enabling us to study if with a lower depth we could achieve the same results. The obtained results had around 10% more error overall. The third ablation study evaluates the impact of removing the Pyramid Fusion from the original architecture. We replaced the Pyramid Fusion with just a simple concatenation of the multiple features, a convolution layer, and a ReLU activation function. This third experiment showed that fusing multiple feature resolutions without using the pyramid approach, increases the overall error by 15%

Table 5 Ablation studies (using area under the ADD-S (3) curve (AUC)) on the YCB-Video dataset

6 Conclusion

We propose a method that consists of a single feed-forward network that can do the complete inference from data to 6D pose estimation. Our method can be used in real-time taking only 0.12 seconds to retrieve an accurate 6D pose estimation of a known object present in the scene. Our method has the best overall performance in both used datasets (LineMOD and YCB-Video). In the LineMOD dataset, we achieved 99.7% of accuracy, having 0.3% better accuracy than the second-best method PVN3D. In the YCB-Video dataset, we achieved 98.06% area under the ADD-S curve which is 1.96% better than PVN3D. We performed ablation studies to clarify the impact of the three main architecture components on the overall error rates and found that the removal of these components accounts for a similar error increase on both used datasets, indicating that their benefits are not dependent of the particular data used. In the future, the developed pipeline will be implemented into a robotic setting that will perform unconstrained pick-and-place tasks.