6DoF assembly pose estimation dataset for robotic manipulation

Robotic assembling is a challenging task that requires cognition and dexterity. In recent years, perception tools have achieved tremendous success in endowing the cognitive capabilities to robots. Although these tools have succeeded in tasks such as detection, scene segmentation, pose estimation and grasp manipulation, the associated datasets and the dataset contents lack crucial information that requires adapting them for assembling pose estimation. Furthermore, existing datasets of object 3D meshes and point clouds are presented in non-canonical view frames and therefore lack information to train perception models that infer on a visual scene. The dataset presents 2 simulated object assembly scenes with RGB-D images, 3D mesh files and ground truth assembly poses as an extension for the State-of-the-Art BOP format. This enables smooth expansion of existing perception models in computer vision as well as development of novel algorithms for estimating assembly pose in robotic assembly manipulation tasks.


Subject
Computer Science.Specific subject area Cognitive Robotics.Type of data Simulated, RGB Image, 3D Mesh, Depth Image.Data collection Data were generated using the gazebo simulator [ 9 ] with 3D mesh files of assemblies obtained from thingyverse database [ 6 , 7 ].The images were captured through a simulated RealSense D435i camera following hemisphere sampling procedure.Data

Value of the Data
• Assembling is a demanding skill in robotic manipulation often addressed as a perception problem.We present a dataset of 2 assemblies simulated in a tabletop scene with information required for training and inference of perception based deep learning model that can endow assembling skills to a robot manipulator.• The existing datasets [ 2 , 5 ] in research contain only geometric information and do not accurately represent information from a robotic perception environment.In contrast, our dataset presents multiple view samples of a tabletop assembly scene acquired through a depth sensor with relevant ground truth information.
• The fellow research community can easily adopt the data generation pipeline for any object assembly without limitation to a certain category of objects (e.g.: Furniture, mechanical components etc.).The dataset is formatted as an extension to the BOP format [ 3 ] which is the state-of-the-art benchmark for 6D pose estimation of objects.Furthermore, the dataset can further be utilized for benchmarking different assembly pose estimation techniques as the ground truth labels are provided.

Background
The purpose of this dataset is to produce features required to learn the spatial relationships and assembling sequence among objects in an assembly.These features must be extracted from the environment using sensor inputs in a real robotic application.In contrary to 3D mesh files and point cloud datasets which define an assembly in an arbitrary coordinate frame, a depth sensor can observe only a partial view of the object with respect to its coordinate frame.Therefore, an assembling scene viewed through a RGBD sensor fixed on a robot manipulator produce more accurate representation of a robotic assembly scene as presented in this dataset.The capabilities of modern physics simulators to simulate objects and camera sensors with definable parameters were utilized to produce the simulated dataset efficiently.We trained and benchmarked 6DAPose [ 1 ] which is an assembly pose estimation framework for robotic assembling based on this dataset.

Data Description
The dataset consists of the following object assemblies: Tables 1 and 2 The directory architecture of the dataset is extended from BOP format.The directory level 1 (root) of the dataset is structured as described in Fig. 1 .
• step_1, step_2, … directories contain information corresponding to each assembly step of the assembly.• corners.pklstores the 8 corners of a bounding box containing 3D mesh for all objects (row number corresponds to object id).• gt_assembly_poses.jsondefines the assembly pose of each object with respect to base object.
(This is an optional file only required in data generation process).• model_meshes contains triangle mesh files of all objects in the assembly.
• model_pointcloud contains 3D point clouds of all objects.
• model_info.json is an optional file generated using BOP Toolkit [ 8 ] for pose error calculations.
In each assembly step directory, there N samples of simulated assembly scenes.Each assembly scene contains information in following directory structure.• scene_gt.jsoncontains ground truth 6DoF pose labels for each object in the scene.obje_id : object identifier.cam_R_w2c : object rotation matrix with respect to color optical frame (row-wise).cam_t_w2c : object translation matrix with respect to color optical frame (row-wise).• scene_w_gt.jsoncontains the same information as scene_gt.jsonwith respect to world coordinate frame.• Both dataset and dataset generation scripts are publicly available.
• Instructions for custom data generation is hosted in the repository at https://github.com/KulunuOS/6DAPose .

Experimental Design, Materials and Methods
The assembly scenes were simulated using gazebo classic physics simulator [ 9 ].

CAD file preprocess:
The CAD files of all the objects in an assembly were acquired from opensource CAD archive thingyverse.com.CAD files were preprocessed to be compatible with the simulation by conversion to Polygon File Format (.ply).These files were further transformed to Simulation Description Format (.SDF) with texture properties.In an offline process involving human input, the assembly sequence and 6DoF assembly pose labels were annotated.

Assembly scene simulation:
In order of assembly sequence, the objects and partially assembled objects were randomly placed on a table simulated on the origin of the simulation space.The table was 1 m high and the surface was white color.A single spotlight source was simulated above the assembly scene.The objects were placed on the most stable position under gravity ( Fig. 2 ).

RGBD sensor simulation:
Gazebo simulator is associated with Robotic Operating System (ROS Noetic) framework and simulates sensors using plugins and Universal Robot Description Files (URDF).We use the opensource RealSense ROS plugin implemented by pal-robotics [ 10 ] and RealSense robot description [ 11 ].RealSense D435i camera has multiple optical frames separately for color sensor, depth sensor and camera body and ROS tf_tools library publishes transformations between these frames which is important for recording data accurately, (Fig. 3 ).

Data capturing algorithm:
We control the position of the simulated RGB-D sensor and sample viewpoints from an upper hemisphere centered around the origin of the simulation space following the hemisphere sampling algorithm of [ 4 ].At each parametrized view sample, we record the information following the Algorithm 1 described below.

Parameters :
φ: Yaw angle of the camera.θ: Pitch angle of the camera.
s: Scale of the camera.Inputs : 3D mesh models of the Assembly Procedure : 1. Define and record assembly constraints.

for each assembly step:
for each incremental value of φ, θ, s:

Limitations
Firstly, we annotate the assembly steps with human expertise rather than exhaustively checking for collisions in simulation.This is suitable when the assembling order and poses have only a single solution.In the presence of multiple correct assembly steps and pose configurations it is optimal to follow assembly-by-disassembly concept while checking for collisions to annotate the dataset [ 5 ].
Secondly, the sim-to-real gap in simulated data is considerable due to the limitations of the capabilities of gazebo classic simulator.This could be overcome by implementing the same pro-cedure with ignition gazebo (gazebosim.org).However, realistic color images are only important when the pose estimation algorithms rely heavily on color image features.

Ethics Statement
Authors declare that our work follow the ethical requirements for publication in Data in Brief and we confirm that our work does not involve human subjects, animal experiments, or any data collected from social media platforms.
Recordi.I RGB (color image) ii.I D (depth image) iii.I S (segmentation map) iv.P obj (Ground truth 6DoF pose of objects) v. Pcam (Ground truth 6DoF pose of camera) vi.Kcam (Ground truth camera parameters Outputs : I RGB, I D, I S, P obj, Pcam, Kcam

Table 3
provides a summary of a single assembly (Nema17 reducer assembly) from dataset:

Table 3
Raw data sample from Nema17 assembly dataset.