FUSION OF INFORMATION FROM MULTIPLE KINECT SENSORS FOR 3D OBJECT RECONSTRUCTION

In this paper, we estimate the accuracy of 3D object reconstruction using multiple Kinect sensors. First, we discuss the calibration of multiple Kinect sensors, and provide an analysis of the accuracy and resolution of the depth data. Next, the precision of coordinate mapping between sensors data for registration of depth and color images is evaluated. We test a proposed system for 3D object reconstruction with four Kinect V2 sensors and present reconstruction accuracy results. Ex-periments and computer simulation are carried out using Matlab and Kinect V2.


Related work
This section contains information about feature representations for depth, multi-camera approaches, and available datasets obtained with (consumer) depth cameras.
In order to fill small holes and to eliminate noise, the median and binomial filters were used [18 -23]. Moreover, the use of the color information in the point correspondence process avoids false positives matches and, therefore, leads to a more reliable registration. Note that by adjusting iterative closest point (ICP) algorithm and reconstruction parameters it is possible to improve the registration and appearance of details that were invisible with just one scan due to the sensor-limited precision. Finally, it was shown [24] that 3D smooth surface of objects can be reconstructed using low precision sensors such as Kinect.
In [15] authors generate point clouds from the depth information of multiple registered cameras and use the VFH descriptor to describe them. For color images, they employ the DPM and combine both approaches with a simple voting approach across multiple cameras.
A new method for the occluded object visualization using two Kinect sensors at different locations was proposed in [25].
The interference problem of multiple Kinect cameras dramatically degrades the depth quality. In paper [26] an algorithm for interference cancelation in systems with multiple Kinect camera was proposed. This algorithm takes advantage of statistic property of depth map, propagates reliable gradient from interference-free region to interfered region, and derives depth values with complete gradient map under the least errors criterion.
A novel approach to combine data from multiple lowcost sensors to detect people in a mobile robot was proposed in [27]. This work is based on the fusion of infor-mation from Kinect and a thermal sensor (thermopile) mounted on top of a mobile platform.
In [28] authors proposed a human action recognition system using multiple Kinect sensors based on multiview skeleton integration. In [13] the Kinect Fusion was designed for the 3D reconstruction of a scene in real-time using the Kinect sensor. The system was applied for Visual Navigation of a Robotic Vehicle when no external reference like GPS is available.
A mirror movement's rehabilitation therapy system for hemiplegic patients was proposed in [29]. This system is based on two Kinects to eliminate problems like limb blocking, data loss in one Kinect by Brusa seven-parameter model and RLS method for coordinate transformation. Two networked Kinect sensors are used for real-time rigid body head motion tracking for brain PET/CT in [30]. Multiple Kinect Fusion allows head motion tracking in the case when partial and complete occlusions of the face occur. To increase the accuracy of human joint position estimation for rehabilitation and physiotherapy information-fusion from multiple Kinects was used [31]. It was shown that the most significant improvement is achieved with two Kinects and the subsequent increase of the number of receivers is not significant.
A system for live 3D reconstruction using multiple Kinect sensors is presented in [16]. This paper describes a general design of the system architecture, the method of estimating the camera poses in 3D space, the problem of interference between multiple Kinect V2 devices. The following issues of 3D reconstruction using multiple Kinect are improved: automated markerless calibration, improved noise removal, tessellation of the output point cloud and texturing.
One of the most important problems is a calibration method for multiple Kinects. In paper [32], an accurate and efficient calibration method for multiple Kinects was proposed by overlapping joint regions among Kinects and extra RGB cameras, containing a sufficient number of corresponding points between color images to estimate the camera parameters. The parameters are obtained by minimizing both the errors of corresponding points between color images and the ones of range data of planar regions from the environment.
A method to calibrate multiple Kinect sensors was developed in [33]. The method requires at least three acquisitions of a 3D object from each camera or a single acquisition of a 2D object, and a point cloud from each Kinect obtained with the built-in coordinate mapping capabilities of Kinect. The proposed method consists of the following steps: image acquisition, pre-calibration, point cloud matching, intrinsic parameters initialization, and final calibration.
Different methods to calibrate a 3-D scanning system consisting of multiple Kinect sensors were investigated in [34]. A sphere, a checker board and a cube as the calibration object were proposed. A cubic object for the calibration task is the most suitable for this application.
A novel method for simultaneous calibration of relative poses of a Kinect and three external cameras by op-timizing a cost function and adding corresponding weights to the external cameras in different locations is proposed in [35]. A joint calibration of multiple devices is efficiently constructed.
A real-time 3D reconstruction method to extend the limited field of view of the depth map of Kinect sensors by using registration data of color images to construct depth and color panorama is proposed in [36]. An efficient anisotropic diffusion method is proposed to recover invalid depth data in the depth map from the Kinect sensors.
A calibration procedure with Kinects coordinate mapping to extract registered color, depth, and camera space data during the acquisition step was proposed in [17]. Using three acquisitions from each camera, the calibration procedure is capable to obtain the intrinsic and extrinsic parameters for each camera. A method for point cloud fusion after calibrating the cameras is suggested in [17].
RGB-D datasets for different applications including object reconstruction and 3D-simultaneous localization and mapping were proposed in [10]. We choose the TUM Benchmark dataset for evaluating visual odometry and visual SLAM (Simultaneous Localization and Mapping) [37].

The proposed system
This section provides the proposed system with fusion of information from multiple Kinect sensors for object 3D reconstruction.
We consider the following steps of the proposed system: data acquisition, calibration, point cloud fusion.
Step 1: data acquisition The acquisition step obtains two types of data from each Kinect: a point cloud PC i from each Kinect sensor; the 2D projections of the point cloud on the depth and color cameras. We represent depth information as point clouds, i.e. as a set of points in the 3D space with 3D coordinates. This allows easy aggregation of the depth information available from multiple registered cameras views. We retain RGB color and depth images as well as 3D point clouds for each camera and registered multiview point clouds computed from the depth images. Note that the depth data provided by Kinect is noisy and incomplete. Noisy data possess to variations between 2 to 4 different discrete depth levels. So, we smooth the depth data by averaging over 9 depth frames. In order to recover incomplete regions, median filtering is utilized. The interference problem of multiple Kinect cameras was solved even for large angles between Kinects.
Camera space refers to the 3D coordinate system used by Kinect, where the coordinate system is defined as follows [17]: the origin is located at the center of the infrared sensor on the Kinect; the positive X direction goes to the sensor's left; the positive Y direction goes up; the positive Z goes out in the direction the sensor is facing; and the units are in meters.
The data acquisition procedure follows the next steps [17]: 1. On each sensor, we use the color camera to capture images synchronously in order to detect the colored markers. For this, we use small color blobs that satisfy certain constraints: on the 1D calibration pattern, the color points should lie on a line with the fixed length; and for the 2D pattern, the color points must have fixed distances given by the 2D pattern. The red, green, blue, and yellow markers are defined as a, b, c, and d respectively. All the cameras must find the three or four markers to count as a valid frame. 2. Map the coordinates of a, b, c, and d by the coordinate mapping from color space to camera space obtaining Step 2: calibration Our pose estimation procedure consists of two steps: pre-calibration and calibration.
The pre-calibration procedure provides an initial rough estimate of the camera poses. We calibrate the extrinsic matrix between different Kinect cameras using ICP algorithm and markers as the calibration object [34,33,17]. We define the depth camera of the first Kinect as the reference.
The second procedure performs a full camera calibration, i.e., it computes intrinsic and extrinsic parameters, by finding numerous 3D point matches between pairs of adjacent cameras. We select a less computationally expensive solution; that is, the R-Nearest Neighbor [17], which is described below.
First, apply the transformations obtained during the pre-calibration step to each of the point clouds PC i , i ∈ {2, 3, 4} to align them with the reference point cloud from the first sensor PC 1 . Once the point clouds are aligned, search for 3D points matches between the reference point cloud and the rest using the nearest neighbor approach inside a radius R = 2 millimeters. A point p is the R-near neighbor of a point q if the distance between p and q is at most. The algorithm either returns the nearest R-near neighbor or concludes that no such point exists. Note that the point clouds must have overlapped points to find matches between pairs of cameras.
The pose is represented by a 4×4 rigid transformation containing the rotation R and translation t that align each of the cameras with the reference. To obtain these transformations we used the camera space points A i , B i , C i , D i from each Kinect sensor obtained in the data acquisition step. Setting the first Kinect (i = 1) as the reference, the problem of obtaining the pose boils down to obtaining the best rotations R i and translations t i that align the points from the Kinect sensors 3, 4} to the points in the reference Kinect (M 1 ).
The calibration of the extrinsic matrix procedure consists of the following steps: 1. Solve for R i and t i : 3. Move the points to the origin and find the optimal rotation R i : where H i is the covariance matrix of the i-th Kinect and SVD denotes the singular value decomposition. 4. Find the translation t i as 1 .
. Apply a refining step using Iterative Closest Point (ICP) on each aligned point cloud with the reference, to minimize the difference between them. The aligned point clouds will be denoted as PC i , i ∈ {2, 3, 4}. The matching points between the reference Kinect and the rest (obtained in the point cloud matching step) are the 3D points in the world reference and will be denoted by P Wi for i ∈ {2, 3, 4}, the 2D projections of these points on the image plane are denoted by u i = (u, v) and are known from the acquisition step.
In homogeneous coordinates, the mapping between points P W = (x, y, z) and their 2D projections u = (u, v) in the image plane is given by where K is the intrinsic parameters matrix or camera parameters, and [R, t] W→C the extrinsic parameters, R is a 3×3 rotation matrix that defines the camera orientation and t is a translation vector that describe the position of the camera in the world. Our goal is to compute the intrinsic parameters K which contains the focal length (α, β), a skew factor (γ), and the principal point (u 0 , v 0 ) for fixed extrinsic parameters [R, t] obtained with the pre-calibration step. We estimate the intrinsic parameters as proposed in [33] where r jk and t x , t y , t z are the rotation and translation elements of the known pose transformation between the reference camera and the camera which we want to estimate the intrinsic parameters.
Step 3: point cloud fusion Fusing all the point clouds with color into a single one can be done using the calibration data from each camera [17]. After acquiring a depth and color frame from each Kinect sensor, we undistort the depth image and obtain the [x, y, z] coordinates of each pixel in the 3D world space. The [x, y, z] points are mapped onto the color frame using the intrinsic and extrinsic parameters of the color camera to obtain the corresponding color of each 3D point. Finally, to merge the colored 3D data we use the extrinsic parameters of each camera, i.e., the pose between each camera and the reference are utilized to transform all the point clouds into a single reference frame.

Experimental results
In this section, we present the results to evaluate the performance of our proposed method of calibration and fusion of multiple Kinect sensors for object 3D reconstruction.
In our experiments, four Kinect V2 sensors connected to four computers which featured an Intel core i7 processor with four cores and 16 GB of memory were used. To evaluate the performance of our calibration method, we carried out point cloud fusion and 3D reconstruction of a chair from database [37]. The object was placed in the field of view of the four cameras, wherein depth map and RGB frames were acquired by each Kinect V2 sensor. Fig. 1 shows RGB images and depth maps of a chair taken by four Kinect sensors using real data.
The Kinect accuracy is not very good and degrades with distance [38]. However, our calibration method with computed distortion parameters yields a better accuracy than the Kinect's built-in mapping. To evaluate our calibration results qualitatively, we mapped the [x, y, z] points onto the color frame using the intrinsic and extrinsic parameters of the color camera. In this way, the corresponding color of each 3D point is obtained. Finally, by merging the colored 3D data from the four Kinect sensors we got a 3D fused point cloud which then used for reconstruction of a meshed object with MeshLab. Fig. 2 shows the 3D reconstruction of a chair. The 3D model is fine and accurate.
The experiment has shown that the proposed method of calibration and fusion of multiple Kinect sensors is able to provide accurate 3D object reconstruction. All frames from multiple Kinect sensors are fused correctly.
Also, we have experimental results for evaluation of the performance of the proposed system with fusion of information from RGB-D sensor for object 3D reconstruction. The metric of evaluation is the root mean square error (RMSE) of measurements. where ED is the estimated measurement by a device and RD is the real known measurement of the object. By taking five measurements, the average values were calculated. Corresponding RMSE values calculated for Kinect V2 are shown in Table 2. The results show that Kinect V2 yields a necessary accurate 3D model of the object. The obtained accuracy allows us to make all measurements on the 3D model as on a real object.

Conclusion
In this paper, we proposed the system of fusion of multiple Kinect sensors for object 3D reconstruction. The procedure contains the following steps: data acquisition, calibration of multiple Kinect sensors, point cloud fusion from multiple Kinect sensors. The implementation is done in MATLAB using the Kinect V2. We evaluated the performance of our proposed system for object 3D reconstruction using real data. The experiment has shown that the proposed method of calibration and fusion of multiple Kinect sensors for object 3D reconstruction is accurate.