1 Introduction

Publicly available data for algorithm evaluation are instrumental to achieving scientific progress in many disciplines. Providing a common ground for objective and reliable bench marking improves research transparency and reproducibility [21]. The fields of computer vision and robotics are no exception. One relevant problem in robotics—the vision-based mobile robot navigation—is situated on the crossroads of these fields. Moreover, recent launch of compact, inexpensive RGB-D sensors based on structured-light [18] or time-of-flight cameras [19] provided the robotics community with an attractive sensing solution, enabling the integration of the depth and vision data within the robot navigation processing pipeline for increased accuracy.

The goal of vision-based navigation is the recovery of camera trajectory as a series of sensor poses based on the sensor data (i.e., visual odometry (VO) problem [13, 28]). Another approach to visual trajectory reconstruction is simultaneous localization and mapping (SLAM) [7, 8]. The VO approach is usually key frame based. In most cases, the transformation function for coordinates of sparse sets of points matched across a number of frames is known [17, 35]. This enables the use of various frameworks for inter-frame transformation error minimization for establishing a series of consecutive sensor displacements and poses [34]. The canonical SLAM approach on the other hand is based on filtering and aims at accumulating knowledge from past measurements to update the structure and motion information in the form of a possibly accurate probability distribution. The two concepts are combined within the recently popular graph-based SLAM. In this approach, the nodes of the graph correspond to poses of the robot and the edges in the graph express the spatial constraints between the nodes. The constraints are introduced based on successive measurements. The spatial configuration of the nodes that are maximally consistent with the measurements is found by solving an error minimization problem [15].

In this paper, a new RGB-D data set containing the measurements performed using both the first- and the second-generation Kinect (hereinafter Kinect v1 and Kinect v2, respectively) is described. From here on out, we will refer to the data set as PUTK\(^2\)—PUT Kinect 1 and Kinect 2 data set. The data set consists of eight sequences, registered in an office-like environment. The sequences are composed of color images and depth data frames acquired by both the sensors and are supplemented with accurate ground truth, enabling full reconstruction of relative displacement and pose changes across all sensor positions. The ground truth was established using an overhead system of mutually calibrated cameras. Moreover, great care was taken to ensure proper synchronization of the acquisition by both sensors and the overhead camera system, which further contributes to the overall accuracy of the trajectory reconstruction. The availability of precise ground truth data facilitates reliable navigation algorithm benchmarking using both color and depth images. The successful use of typical CCD cameras, instead of a commercial IR-based tracking system, like Vicon [36] or OptiTrack [23], to generate ground truth for a navigation-related data set demonstrates that with some extra effort on the programming and calibration side such a data set can be obtained at a fraction of the costs that are necessary to experiment with any of the commercial motion capture systems. Furthermore, the use of both versions of the Kinect sensors enables accurate side by side comparisons of characteristics of both devices.

To demonstrate the usefulness of the acquired data, results of the evaluation of a VO algorithm using the obtained sequences are presented and discussed. The evaluation includes the investigation of the influence of properties of both Kinect sensors on visual navigation performance. VO was chosen over SLAM for demonstration purposes, as for VO only the local consistency of the trajectory and the local map (sensor displacement model) is used to obtain the local trajectory estimate, whereas SLAM aims at assuring the consistency of the global map [37]. By its very nature, VO is therefore burdened by the unbounded local drift, but depends more directly on the quality of the input data. The complete benchmark data are available for download at: http://lrm.put.poznan.pl/putkk/ Aside from the images and trajectories, the Web site contains an additional, detailed description of the data formats and sample videos of the registered sequences.

This paper is organized as follows. Section 2 describes related work—other available vision-based navigation benchmark data and work on the comparison of Kinect sensors. Section 3 gives a brief description on the multi-camera acquisition setup for trajectory reconstruction, its configuration and calibration. The registered trajectories and their properties are presented in Sect. 4. Section 5 presents the VO algorithm using the registered data for accuracy benchmark, and the detailed results of the tests are presented in Sect. 6. Finally, Sect. 7 contains conclusions and summarizes the paper.

2 Related work

The need for common ground for reliable and objective robot navigation algorithm evaluation is long acknowledged in the research community. As algorithms improved and more sensors applied to this task, databases for laser scanner and vision-based navigation emerged [5, 31]. Unfortunately, these early benchmarks do not include the depth data used by many of the state-of-the-art navigation algorithms. Moreover, as shown in [29], the data sets may contain improperly labeled data, and the acquisition process is usually not synchronized, which often limits their usefulness. Introduction of inexpensive RGB-D sensors spurred interest in their applications in computer vision and robotics. This led to the emergence of a range of data sets containing RGB-D data. While most of the publicly available benchmark data are meant for applications such as object segmentation and recognition [25, 30, 38], data sets for navigation evaluation have also been published.

The benchmark first presented in [33] and expanded in [32] contains 39 corresponding RGB and D image sequences gathered using the Asus Xtion (based on Kinect 1) sensor with ground truth registered using a motion capture system. Regrettably, for lengthy parts of the sequences taken by the sensor mounted on the robot, no physical objects are visible within the depth sensor’s working range. Under such conditions, the inter-frame displacement and pose change cannot be established without resorting to a vision-only approach based on RGB images. Moreover, the data from the motion capture system and the data from the Kinect sensor are not perfectly synchronized because of the different sampling frequencies and potentially missing data, so that, an additional data association and interpolation step is required. A very large scale data set, containing data acquired along a 42 km indoor trajectory using stereo camera, laser scanners, IMU and the Kinect v1 sensor, is described in [11]. The ground truth for the data set was collected by aligning the laser scan results with the highly accurate construction plans of the building. The authors do not, however, describe how the sensor data are synchronized with the acquired ground truth trajectory. As stated in the article, the accuracy of the system is believed to be about 2–3 cm, but no detailed analysis or evaluation of this claim is given. While the system is undoubtedly very useful for evaluation of long-term autonomy, the solution presented in this paper is a better choice for navigation accuracy testing due to more precise ground truth and near-perfect synchronization. Another interesting approach to RGB-D-based navigation bench-marking is presented in [16]. The data set consists of images obtained from camera trajectories in ray-traced 3D models. As the data set is fully synthetic, perfect ground truth for trajectories and structure is directly available. However, such benchmarks, although useful, cannot fully replace the evaluation in real-life conditions. As shown in [12] and [18], the Kinect sensors exhibit distinctive characteristics, which influence the acquisition process significantly. To the knowledge of the authors, none of the currently available navigation benchmark contain the data registered using both types of Kinect sensors. The articles containing comparison of the accuracy of both types of Kinect sensor focus mostly on applications outside of robotics [1, 39]. The rare exceptions deal with the properties of the sensors themselves, evaluated in the static, laboratory environment [20] or scenarios other than navigation such as recent work [14], which considers the tasks of 3D reconstruction and object recognition when comparing Kinect v2 with the RGB-D sensors based on structured light. The quantitative comparison with respect to ground truth obtained with a metrological laser scanner presented in [14] revealed that Kinect v2 provides less error in the mapping between the RGB and depth frames, and the obtained depth values are more constant with distance variations.

3 Vision-based motion registration

3.1 Structure of the multi-camera vision system

The multi-camera vision system used for the registration of the PUTK\(^2\) data set consists of five high-resolution Basler acA1600 cameras equipped with low-distortion, aspherical 3.5-mm lenses. The cameras are installed in an X-like pattern under the ceiling of the laboratory. Due to this particular arrangement, the field of view (FOV) of each of the peripheral cameras partially overlaps with FOVs of the two neighboring peripheral cameras and the central one. The cameras’ arrangement is shown in Fig. 1, whereas Fig. 2 presents the idea of the overlapping FOVs. As a result, the central part of the room, in which most of the movement occurs, is observed by at least two cameras (usually three). This increases the accuracy of the reconstructed ground truth trajectory and facilitates the calibration of the multi-camera system.

Fig. 1
figure 1

Arrangement of the cameras in the multi-camera system

Fig. 2
figure 2

Overlapping FOVs of the multi-camera system

Table 1 The intrinsic parameters of the multi-camera system
Fig. 3
figure 3

Calibration marker observed simultaneously by three cameras

Each of the cameras was calibrated according to the rational lens model proposed by Claus and Fitzgibbon [6], assuming that the projection of a 3D point \(p=\left[ \begin{array}{ccc}x&y&z \end{array}\right] ^\mathrm{T}\) onto image coordinates \(q=\left[ \begin{array}{cc} u&v \end{array}\right] ^\mathrm{T}\) is calculated as:

$$\begin{aligned}&x'=\frac{x}{z} \end{aligned}$$
(1)
$$\begin{aligned}&y'=\frac{y}{z} \end{aligned}$$
(2)
$$\begin{aligned}&r^2=x'^2+y'^2 \end{aligned}$$
(3)
$$\begin{aligned}&R=\frac{1+k_1r^2+k_2r^4+k_3r^6}{1+k_4r^2+k_5r^4+k_6r^6} \end{aligned}$$
(4)
$$\begin{aligned}&x''=x'R+2p_1x'y'+p_2(r^2+2x'^2) \end{aligned}$$
(5)
$$\begin{aligned}&y''=y'R+2p_1(r^2+2y'^2)+p_2x'y' \end{aligned}$$
(6)
$$\begin{aligned}&u=f_xx''+c_x \end{aligned}$$
(7)
$$\begin{aligned}&v=f_yy''+c_y \end{aligned}$$
(8)

where \(f_x\), \(f_y\), \(c_x\) and \(c_y\) are the focal lengths and the principal point, while \(k_i\) an \(p_i\) stand for the ith radial and tangential distortion coefficients, respectively. The obtained intrinsic parameters and the corresponding average reprojection errors are given in Table 1.

The precise reconstruction of the robot’s trajectories requires knowing the exact poses of the cameras. The global multi-camera system calibration method proposed by Schmidt et al. [29] was used. Observations of a calibration marker placed in different poses, in which it is visible from at least two cameras (Fig. 3) are used to simultaneously minimize the reprojection error on all the images registered by all the cameras using the Levenberg–Marquardt algorithm.

The model parameters vector \(\beta _\mathrm{MCS}\) consists of the poses of the peripheral cameras and the poses of the calibration marker during consequent observations:

$$\begin{aligned} \beta _\mathrm{MCS}=\left[ \begin{array}{cccccccccc} r_{P_1}&t_{P_1}&\cdots&r_{P_4}&t_{P_4}&r_{M_1}&t_{M_1}&\cdots&r_{M_N}&t_{M_N} \end{array}\right] \end{aligned}$$
(9)

where \(r_{P_i}\) and \(t_{P_i}\) stand for the orientation described using the Rodrigues’ rotation formula and translation vectors of the ith peripheral camera w.r.t. central camera’s coordinate system. Similarly, \(r_{M_i}\) and \(t_{M_i}\) represent the ith pose of the calibration. The results of the calibration procedure are presented in Table 2 and in Fig. 4.

Table 2 The rotation (\(r_x\), \(r_y\), \(r_z\)) and translation (\(t_x\), \(t_y\), \(t_z\)) vectors of the peripheral cameras with respect to the Marker’s coordinate system
Fig. 4
figure 4

Estimated poses of the peripheral cameras (P1—green, P2—red, P3—cyan and P4—magenta) and poses of the calibration marker (black) with regard to the central camera coordinate system (blue) (color figure online)

3.2 Experimental setup with the Kinect sensors

The mobile robot used in the experiments was equipped with two sensors: Kinect v1 and Kinect v2 facing forward, and a firmly attached chessboard marker facing up. The marker was used to calculate the ground truth trajectory of the robot.

As the visual navigation algorithms track the pose of the moving camera, it was necessary to estimate the transformations between the robot’s sensors and the chessboard marker. The algorithm described in [29] was used for that purpose. The method uses images of an external calibration marker observed by the robot’s sensors and images registered by an external camera observing both the external and the robot’s marker (Fig. 5).

Fig. 5
figure 5

Calibration of the Kinect sensors w.r.t. the Marker’s coordinate system

Before the procedure, both the Kinect sensors (\(K_1\) and \(K_2\)) and the external camera (E) were calibrated using the rational model. Afterward, the poses of the Kinect sensors w.r.t. the Robot Marker were determined by simultaneous minimization of the reprojection error of the chessboard patterns’ (both the external and the robot’s) corners on all the images registered by the Kinect sensors and the external camera. Figure 6 shows the visualization of the estimated sensor’s positions, while Table 3 contains the numerical data.

Fig. 6
figure 6

Estimated poses of the two Kinect sensors (blue), the external camera (red) and the external marker (green) w.r.t. Robot Marker’s coordinates (black) (color figure online)

Table 3 Rotation (\(r_x\), \(r_y\), \(r_z\)) and translation (\(t_x\), \(t_y\), \(t_z\)) vectors of the Kinect sensors
Fig. 7
figure 7

Functional schematics of the registration system

The registration system schematic is presented in Fig. 7. It consisted of five overhead cameras, two Kinect sensors, three computers and some additional network equipment. Such distributed solution was necessary as the stream of data from sensors was to intensive to handle for a single device. Thus, one of the computers was responsible for capturing video streams from the overhead cameras. Dedicated network adapters were used to facilitate this task, providing separate connection for each camera. The same machine was also used as a server that synchronized the data acquisition procedure. Two low-profile notebooks were mounted on the robot for grabbing frames from both Kinect sensors. To provide the throughput necessary to save all the acquired data, each computer was equipped with a fast SSD disk.

Registration system elements were connected by Ethernet cable, and the TCP/IP-based communication was used. This solution was preferred over wireless, using either WiFi or proprietary modules, as it offered the lowest and more importantly the most stable delay. The delay was monitored during the whole experiment and was below 1 ms.

The server awaits for all the elements of the system to be in ready state, and when that happens, the trigger signal is sent. That way all the acquired images are synchronized, noting that the maximum frequency of operation was only about 11 frames per second, as the system works as fast as the slowest component.

4 Gathering the RGB-D data

The robotic setup used for our experiments with Kinect v1 and Kinect v2 sensors and vision-based ground truth system is presented in Fig. 8.

Fig. 8
figure 8

Coordinate systems of Kinect v1 (\(K_1\)), Kinect v2 (\(K_2\)), ceiling cameras/global (G) and the pattern on a mobile robot (P)

In this setup, the ceiling-mounted cameras provide information about the motion of the image calibration pattern attached to the robot (coordinate system P) in the coordinate system of the ceiling cameras, considered as the global coordinate system (G), which can be written as \({}^{G}\mathbf {P}\). Transformations between the coordinate systems of Kinects (\({K}_1\) and \({K}_2\)) and the coordinate system of the pattern mounted on the robot P were found prior to operation with additional external calibration (Sect. 3.2) and are denoted as \({}^{P}\mathbf {K}_1\) and \({}^{P}\mathbf {K}_2\) for Kinect v1 and Kinect v2, respectively. To ease the use of the data, we provide the ground truth trajectory of the Kinect v1 and the Kinect v2 with respect to the global coordinate system G, which was computed with the following equation:

$$\begin{aligned} {}^{G}\mathbf {P}{}^{P}\mathbf {K}_i = {}^{G}\mathbf {K}_i, \end{aligned}$$
(10)

where \({}^{G}\mathbf {K}_i\) is the current Kinect ground truth pose estimate in the global coordinate system G and i stands for 1 or 2 for the Kinect v1 or the Kinect v2, respectively. Finally, ground truth sensor trajectories for each of the Kinects and for each experiment in the data set are made publicly available, accompanying the respective sequence of RGB and depth images.

Eight robot trajectories were registered during the experiments. The first four of them simulated the indoor exploration scenario with its characteristic features:

  • long, approximately straight sections,

  • sudden turns,

  • multiple loops,

  • moving backwards.

The main purpose of those trajectories is to provide bench-marking material for the comparison of the visual navigation algorithms using either the Kinect v1 or the Kinect v2 sensors in a typical conditions. Figure 9 contains the reconstructions of those trajectories.

Fig. 9
figure 9

The first four trajectories of the robot w.r.t. coordinate system of the central camera

During the remaining four trajectories, the robot moved approximately along a path with speeds varying between the trials. The purpose of this part of the data set is to provide data suitable for evaluation of both sensors’ robustness to different motion speeds. The overlaid reconstructed trajectories are presented in Fig. 10.

The trajectories were reconstructed offline. The consecutive poses of the robot were calculated independently according to the observations from the motion registration system. Due to the overlapping FOVs of the cameras (cf. Fig. 2), the robot was usually observed by more than one camera. The Levenberg–Marquardt algorithm was used to find the robot pose in the coordinate system of the central camera by minimizing the average reprojection error of the chessboard marker corners on all the images captured for this pose. Table 4 contains the number of frames, the length and the average robot’s velocity for the registered trajectories. All the RGB and depth images are stored in a loss-less PNG format to ensure that image compression will not influence the results of any evaluation using the images as source data. The depth data are stored in a 16-bit grayscale image, in which a single bit corresponds to a depth of 1 mm. The supplementary ground truth position data (position and orientation in quaternion format) have the same structure and format as the ones provided by the data set described in [32]. Camera calibration data are also included, along with sample code snippets for Kinect image distortion correction.

Fig. 10
figure 10

The last four trajectories of the robot

Table 4 The parameters of the registered trajectories

5 RGB-D visual odometry as the test application

5.1 The PUT RGB-D visual odometry system

The VO algorithm enables to determine the motion of the sensor using only a sequence of images, without creating a map of the environment [28]. A VO algorithm can also serve as a base for a solution to the SLAM problem formulated as the pose-based SLAM using graph optimization [4]. The use of RGB-D frames that contain direct depth measurements enables the use of 3D-to-3D feature correspondences for frame-to-frame motion estimation, instead of the 2D-to-2D correspondences in monocular VO [13].

The goal is to estimate the motion between the first frame \(I_{v(k)}\) and the last frame \(I_{v(k+n)}\) in a sequence of n RGB-D frames. This can be accomplished by matching the salient visual features from the first and the last frame, followed by the estimation of the geometric transformation between two sets of 3D points. However, the matching-based approach commonly used in RGB-D VO and SLAM is computationally demanding and often requires hardware acceleration [9]. Therefore, we have proposed an alternative approach, based on sparse optical flow tracking of the visual features [22]. With this approach, we detect the point features in the \(I_{v(k)}\) frame and then track these points through the n images. The tracked features define the correspondences between \(I_{v(k)}\) and \(I_{v(k+n)}\). Then, the depth data are associated with the visual key points resulting in two sets of matched 3D points. The structure of the used VO pipeline is presented in Fig. 11.

Fig. 11
figure 11

Block scheme of the PUT RGB-D VO system

The data processing starts with the detection of a set of point features (key points) that should be localized precisely in the image and be repeatable in a sequence of images showing the same scene. In our previous work [4], we have investigated three point feature detectors: FAST [26], ORB [27] and SURF [3], and we have found the ORB features to be the most suitable, due to their multi-scale corner-like detector, which yields highly repeatable features at a reasonable computation effort.

To make the whole feature extraction process more robust, we detect the features in sub-images and employ clustering of the resulting key points. The detection of features in separate, slightly overlapping sub-images helps to distribute the key points evenly on the image. The RGB image is divided into 16 equal, square sub-images, and the ORB detection is performed individually in each window with the adaptation of detector parameters to ensure a similar number of features in each square. After feature detection, the DBScan [10]—a fast, unsupervised clustering algorithm—is used to detect groups of key points and, then, one strongest key point from each group is selected. This way nearby features that were detected on a small area of the image are represented by one key point, which helps to avoid the aliasing of key points resulting in false feature correspondences.

The core part of the VO pipeline is motion estimation based on two sets of corresponding 3D point features, whose correspondences are determined by sparse optical flow tracking. The VO tracks features over a sequence of RGB images between the two key frames that are processed with depth images. Key points are detected at the key frame, and then the positions of these points in the new image in a sequence are determined by searching locally. To accomplish this, the pyramid implementation of the Lucas–Kanade algorithm [2] is applied. This algorithm is initialized with key points from the ORB detector. If the number of successfully tracked features falls below a given threshold, new key points are detected in the current image and inserted into the pool of tracked features. When the maximum number n of the RGB frames in a sequence is reached (the default value is \(n=4\)), the rigid transformation between the key frames is computed. The transformation estimation procedure is embedded in the RANSAC scheme to make it robust to outliers resulting from imperfect tracking. In every iteration, the RANSAC randomly selects three pairs of points from the set of tracked key points and estimates the candidate transformation using the Umeyama algorithm [35]. A modern variant of the RANSAC algorithm is used, like in [24], that estimates the number of necessary iterations and allows an iterative correction of the final transformation by rejecting the inlier pairs that are the least probable within the estimated model.

5.2 Evaluating the PUT RGB-D VO on the PUTK\(^2\) data set

The PUT RGB-D VO system was already extensively tested on publicly available data sets, and the results were published in [4]. The data sets that were used so far were the TUM RGB-D benchmark [32] and the ICL-NUIM data set [16]. The evaluation revealed that the tracking-based VO is fast and accurate, but only if it is fed by good quality images at a high frame rate [4]. Therefore, we consider the PUT RGB-D VO system a good candidate to evaluate our new Kinect v1 and Kinect v2 data set, as the varying performance of the VO should demonstrate the importance of such factors like image resolution, sensor field of view, the presence of motion blur and depth artifacts.

The operation of the PUT RGB-D VO system on the RGB and depth images from Kinect v1 and Kinect v2 is mainly the same, but some minor modifications were necessary to accommodate the Kinect v2 data. The main reason for these changes is the different resolution of the images provided by both sensors. The Kinect v1 yields RGB images of size \(640 \times 480\) and depth images of size \(320 \times 240\) that are rescaled to match the RGB images. However, the new Kinect v2 produces RGB images of size \(1920 \times 1080\) and depth images of size \(512\times 424\) that again are rescaled to match the color images. Due to the significant difference in image sizes, all parameters or thresholds that are defined in pixel values needed to be properly adjusted. An exemplary parameter is the maximum distance between points belonging to the same cluster in the DBScan algorithm. However, we have decided to still divide RGB frames into the same number of sub-images in both cases. During the tests, the images from both sensors were processed in the VO pipeline, but not memorized, as due to the size of Kinect v2 images they would quickly fill whole RAM of a typical PC.

We demonstrate the trajectory reconstruction accuracy achieved on the Kinect v1 and Kinect v2 sequences. The quantitative results are obtained with respect to the ground truth trajectories, applying the relative pose error (RPE) and absolute trajectory error (ATE) metrics introduced in [32]. The RPE is well suited for measuring the drift of a VO system, like the one we apply in our evaluation of the Kinect v1 and Kinect v2 data. The ATE metric is appropriate rather for full SLAM systems that are able to correct the drift of the estimated trajectory [4]. Although for a VO system the absolute trajectory errors grow with time due to the unavoidable drift, we use this metric to demonstrate visually how far is the estimated trajectory from the ground truth one.

The RPE error corresponds to the local drift of the estimated trajectory. To compute RPE, the relative transformation between the neighboring points of the ground truth trajectory \(\mathbf{T}^\mathrm{GT} = \mathbf{T}^\mathrm{GT}_1,\ldots ,\mathbf{T}^\mathrm{GT}_k\) and the estimated trajectory \(\mathbf{T}^\mathrm{E} = \mathbf{T}^\mathrm{E}_1,\ldots ,\mathbf{T}^\mathrm{E}_k\) is computed, and the relative error at the time stamp i is given by:

$$\begin{aligned} \mathbf{E}_i = \left( (\mathbf{T}^\mathrm{GT}_i)^{-1}{} \mathbf{T}^\mathrm{GT}_{i+n}\right) ^{-1}\left( (\mathbf{T}^\mathrm{E}_i)^{-1}{} \mathbf{T}^\mathrm{E}_{i+n}\right) , \end{aligned}$$
(11)

where n matches the length of the sequence of RGB-D frames tracked by the VO system (cf. Sect. 5.1). Taking the translational or rotational part of \(\mathbf{E}_i\), we obtain the translational or rotational RPE, respectively.

The ATE error is based on the Euclidean distances between the estimated trajectory \(\mathbf{T}^\mathrm{E}\) and the ground truth trajectory \(\mathbf{T}^\mathrm{GT}\). At first, we map the estimated trajectory onto the ground truth trajectory by computing the transformation \(\mathbf{T}^\mathrm{S}\) that is the least-square solution to the alignment problem [32]. Then, the error is computed as:

$$\begin{aligned} \mathbf{F}_i = (\mathbf{T}^\mathrm{GT}_i)^{-1}{} \mathbf{T}^\mathrm{S}{} \mathbf{T}^\mathrm{E}_i, \end{aligned}$$
(12)

for each ith trajectory node (time stamp). We extract the translational component of \(\mathbf{F}_i\) and compute the root-mean- square error (RMSE), along with the standard deviation over all time indices. We use the evaluation tools provided with the TUM RGB-D benchmark [32] to compute the RPE and ATE metrics in our experiments.

6 Evaluation results

All experiments with the VO system were performed on a desktop PC with Intel Core i7-2600 3.4 GHz CPU and 16 GB RAM. The VO uses only a single core of the processor. On the tested sequences, our VO pipeline was running at about 50 frames per second (fps) for the Kinect v1 and at about 10 fps for the Kinect v2 data. As the data sets were recorded at 11 fps, the VO performance can be considered real time for both sensors. Details as to the processing times in the VO pipeline are as follows. The feature detection and management time in a single RGB frame averaged over the tested sequences was 4.25 ms for Kinect 1 and 10.77 ms for Kinect 2. However, considering that detection of the point features is performed only when the number of successfully tracked features falls below a threshold, the average detection time per frame was 1.63 ms for the Kinect v1 and 1.31 ms for the Kinect v2. The average tracking time between two consecutive RGB-D key frames in a sequence was 3.21 ms for the Kinect v1 and 18.11 ms for the Kinect v2. The transformation estimation (RANSAC) took less than 1 ms for both of Kinects—this was possible because only few iterations of RANSAC were necessary to find an acceptable transformation model, which is attributed to the good feature associations maintained by the Lucas–Kanade tracker. In all tests, the maximum number of RGB images for Lucas–Kanade tracking was set to \(n=4\), and the maximum number of tracked features was 500.

Fig. 12
figure 12

Comparison of the trajectory drift for the Kinect v1 and Kinect v2 data with the positional RPE metric on four different sequences: no. 1 (a), no. 2 (b), no. 3 (c) and no. 4 (d)

The impact of the used RGB-D data on the quality of the estimated trajectories has been investigated using the VO system, to avoid a situation when the results are altered by trajectory optimization in a SLAM back end. The RPE computed every n frames using (11) reveals relative errors in translation between the successive RGB-D key frames of the VO system. Figure 12 shows the relative translational errors for the VO system tested on the four sequences that resemble indoor exploration trajectories of a mobile robot. The qualitative difference between the results obtained using the data from Kinect v1 and Kinect v2 is clearly visible on all these plots. The trajectories recovered from the Kinect v2 data tend to have smaller translational RPE values through the whole sequence. For the data from both sensors, the RPE peaks occasionally to values a magnitude larger than the average. These peaks coincide in time with the more sharp turns of the ground truth trajectories. Apparently, when the robot makes a turn, the amount of point features that can be tracked over several frames decreases, contributing to a larger error in the computed transformation between the key frames. However, the RPE peaks for the Kinect v1 and Kinect v2 data rarely appear in exactly the same moments along the trajectories, which suggests that the point features found by the VO system in the images from both sensors are considerably different.

Fig. 13
figure 13

Estimated and ground truth trajectories with the absolute trajectory errors (ATE) for the sequences no. 1 (a, b) and no. 2 (c, d), using the Kinect v1 (a, c) and Kinect v2 (b, d) data

The ATE results, plotted in Fig. 13 for two example sequences, confirm our observations as to the better accuracy of the trajectories recovered from Kinect v2 data. The drift is apparent in all the tested sequences, but it is much more pronounced in the trajectories obtained using Kinect v1 data (Fig. 13a, c).

Fig. 14
figure 14

Comparison of the trajectory estimation accuracy for the first four sequences simulating the indoor exploration scenario: translational RPE (a), rotational RPE (b), and ATE (c)

The differences in the accuracy of the recovered trajectories are clearly demonstrated by the statistical data in Fig. 14. For all four sequences, the trajectories estimated using Kinect v2 data have smaller RPE RMSE, both translational and rotational, as well as smaller ATE RMSE values. In particular, differences in the rotational relative errors are significant (Fig. 13b). This suggests that the wider horizontal field of view in the Kinect v2 sensor enables to obtain more features that are common between the neighboring frames when the robot is turning, whereas in the Kinect v1 data the amount of common point features between the frames considered in the VO is often very small during sharp turns.

Whereas in the navigation task we lack such a precise ground truth, as used for example in [14], to obtain statistics for depth data on controlled scenes, careful inspection of the recorded frames reveals that there are important differences in the quality of the depth data from both sensors. Examples are shown in Fig. 15. The Kinect v1 depth frame (Fig. 15b) contains many areas of missing range information (shown in black), which results from the nature of the pattern of “speckles”—points projected by the IR emitter of the sensor on the scene, and the vulnerability to excessive ambient light [18]. The Kinect v2 depth frame of the same scene (Fig. 15d) has much fewer areas of missing data; particularly, there are no missing data neighboring to edges of objects, which is the case for Kinect v1. The missing depth data in Kinect v2 are rather isolated (Fig. 15h), whereas in Kinect v1 the missing range measurements form relatively large areas (Fig. 15f), preventing the VO from finding valid 3D features in a bigger part of the frame. These differences are allegedly related to the fact that Kinect v1 uses correlation to compare the observed pattern to some reference pattern, which creates dependencies in-between pixels on the depth image, while in Kinect v2 the depth is measured independently for each pixel [14]. The quality of the RGB frames is similar for both sensors, but the Kinect v2 has higher resolution of the RGB camera and a larger horizontal field of view, and thus, an image taken from the same position as for Kinect v1 captures a slightly larger part of the scene.

Fig. 15
figure 15

Example RGB and depth frames from the trajectory no. 1 illustrating differences between the Kinect v1 (a, b, e, f) and Kinect v2 (c, d, g, h)

Table 5 ATE RMSE and RPE RMSE for the RGB-D VO system measured on all sequences
Fig. 16
figure 16

Positional RPE showing the difference in drift on the slowest motion sequence no. 5 (a, b), and the fastest motion sequence no. 8 (c, d) for the Kinect v1 (a, c) and Kinect v2 (b, d) data

Fig. 17
figure 17

ATE plots computed on the sequences no. 5 (a, b) and no. 8 (c, d), for the Kinect v1 (a, c) and Kinect v2 (b, d) data

The trajectories recovered by our VO system from the remaining four trajectories, when the robot moved with different speeds along approximately the same path demonstrated that both sensors are robust w.r.t. the increased motion speed, showing no significant motion blur in the RGB and depth data. Example translational RPE results are shown in Fig. 16. The translational relative errors have similar values for both sensors and different speeds. For the Kinect v2 data, the peak errors are slightly larger. This is attributed to the increased possibility of motion blur in the higher-resolution RGB images from the Kinect v2 sensor. However, the averaged translational and rotational RPE is still slightly higher for the Kinect v1 data for all the motion speeds, as shown in Table 5, which summarizes results for all eight sequences. The ATE results depicted in Fig. 17 show that the absolute trajectory errors are smaller for the sequence with larger speed. Since the localization error in VO is cumulative, integration of individual frame-to-frame transformation errors across a larger number of frames results in a larger overall error. The trajectories 5–8 have similar length, so the trajectories with lower velocity are composed of more frames and the accumulated error is therefore larger.

The Kinect v2 has higher resolution of the images, and the depth data yielded by this sensor usually contain fewer areas of missing depth than the data obtained from Kinect v1. This results in a higher number of useful point features that are found in the RGB data and have associated valid depth values. Therefore, the number of correct matches (i.e., RANSAC inliers) between the sets of features in the neighboring key frames is usually higher for the Kinect v2 frames. The importance of the increased number of point features is particularly visible when the speed of the sensor increases (trajectories 5–8). When the sensor moves faster, distances between the key frames are larger, and thus, it is more difficult to track the key points using the Lucas–Kanade algorithm. More features detected at each key frame helps to ensure that the number of key points that survive tracking till the next key frame will be large enough to compute a correct transformation between the key frames.

7 Conclusions

The article introduces the PUTK\(^2\) RGB-D data set for the evaluation of robot navigation algorithms. The RGB-D data were registered by the Kinect v1 and Kinect v2 sensors moving along eight different trajectories in an indoor environment. The sensor data consist of high-quality, uncompressed RGB and depth images and are supplemented with high-quality ground truth. The ground truth data were generated using a multi-camera system for sensor pose registration, with additional emphasis on careful synchronization during the acquisition process. As a result, the data consist of millimeter-accurate position data and facilitate full pose recovery. To the best knowledge of the authors, the presented data set is the first one to include the data registered by both Kinect sensor types in a side by side setup with such a reliable ground truth. The registration environment was specially arranged to ensure that physical objects are always visible in the working range of the Kinect sensors. Possible applications of the data set include vision- and depth-based navigation evaluation, 3D reconstruction and sensor characteristics comparison. Its usefulness is demonstrated in the paper with an example VO algorithm evaluation.

The evaluation of our RGB-D VO algorithm on the new data set allowed us to compare both versions of the Kinect sensor in the context of indoor navigation. As the tested VO system is feature based, the most important differences between the Kinect v1 and Kinect v2 concern the number of features that can be extracted from a frame and tracked reliably over several frames. The number of features is higher in the Kinect v2 due to the higher resolution of the RGB and depth images, and the smaller number of missing data areas in the depth frames of Kinect v2. Therefore, the Kinect v2 outperformed its predecessor on all the tested sequences with respect to both the relative pose errors and the drift of the whole trajectory. Moreover, the wider horizontal field of view of the Kinect v2 contributes to smaller relative pose errors when the sensor makes turns. The much higher resolution of RGB images makes the processing of Kinect v2 RGB-D data more computation demanding. However, the similar average features detection time per frame achieved by our VO on both the Kinect v1 and Kinect v2 data suggests that the point features are tracked more reliably, over a higher number of RGB images in the Kinect v2 data, as the Lucas–Kanade tracker benefits from the higher resolution of images.

The presented PUTK\(^2\) data set is publicly available and can be freely downloaded and distributed, providing common ground for research and evaluation of both contemporary and future visual navigation algorithms.