GPU based real-time SLAM of six-legged robot

https://doi.org/10.1016/j.micpro.2015.10.008Get rights and content

Abstract

Vision and AHRS (attitude and heading reference system) sensors fusion strategy is prevalent in recent years for the legged robot's SLAM (Simultaneous Localization and Mapping), due to its low cost and effectiveness in the global positioning system. In this paper, a new adaptive estimation algorithm is proposed to achieve the robot SLAM by fusing binocular vision and AHRS sensors. A novel acceleration algorithm for SIFT implementation based on Compute Unified Device Architecture (CUDA) is presented to detect the matching feature points in 2D images. All the steps of SIFT were specifically distributed and implemented by CPU or GPU, according to the step's characteristics to make full use of computational resources. The registration of the 3D feature point cloud is performed by using the iterative closest point (ICP) algorithm. Our GPU-based SIFT implementation can run at the speed of 30 frames per second (fps) on most images with 900 × 750 resolution in the test. Compared to other methods, our algorithm is simple to implement and suitable for parallel processing. It can be easily integrated into mobile robot’s tasks like navigation or object tracking, which need the real-time localization information. Experiments results showed that in the unknown indoor environments, the proposed algorithm`s operation is stable and the positioning accuracy is high.

Introduction

The ability to work in an unexplored environment is crucial to the success of the autonomous robot's operation. The simultaneous localization and mapping (SLAM) is a technique used to build a map of the robot's surrounding while keeping track of its position. In recent years significant attention has been paid to the development of the vision based SLAM [1], [2], [3] . The reasons behind this popularity are the low price and low-power dissipation of cameras, large amounts of information contained in images as well as relatively simple mathematical models of cameras.

Most traditional mobile robot applied sonar [4] and laser ranging [5] to achieve SLAM, due to lower resolution of these distance sensors and the high uncertainty of observed data in complex environments makes them difficult to obtain the desired results. Monocular vision sensor usually combines sonar [6] or laser radar sensors [7] to achieve SLAM, since a single camera cannot get depth information by single image directly, it also need to calibrate the camera parameters and then obtain the three-dimensional coordinates of the feature points by additional method, the positioning accuracy is difficult to ensure.

Feature based simultaneous localization and mapping (SLAM) [8] has a long history. Modern approaches aim at matching all pixels, which require graphic card implementations due to the high computational load and partly rely on active sensors [9], [10] . Scale-invariant feature transform (SIFT) [11] was an algorithm in computer vision to detect and describe local features in images. But the implementation of SIFT is complicated and time-consuming. Recently, accelerating SIFT algorithm on GPU and multicores CPU has achieved a remarkable progress [12] . Clipp B et al. [13] present a novel system for real-time, six degree of freedom visual simultaneous localization and mapping using a stereo camera as the only sensor. The system makes extensive use of parallelism both on the graphics processor and through multiple CPU threads. Seth Warn [14] has implemented parallel SIFT algorithm with OpenMP, and he also implemented Gaussian convolution using CUDA on GPU. Sequential implementations of SIFT are known to have high execution times. The open source sequential implementation SIFT++ [15] takes around 3.3 s on a 2.4 GHz processor for a speed of 0.31 fps for a 640 × 480 image. Sinha et al. [16] have used OpenGL/CG for their implementation on NVIDIA Geforce 7900 GTX card and reported a speed of 10 fps for a 640 × 480 video. Heymann [17] also implemented SIFT using OpenGL on GPU, and he organized the adjacent gray level pixels as RGBA pixels of a vector, which improved the execution speed of 17.24 fps for a 640 × 480 video. Aniruddha et al. [18] presents a parallel implementation of SIFT on a GPU, where obtain a speed of around 55 fps for a 640 × 480 image.

The binocular vision based real-time SLAM [19] only use a pair of cameras which placed in parallel to sense the surroundings. The internal and external parameters of the camera are obtained after calibrated. Three-dimensional coordinates of the feature points can be obtained directly from the left and right images by stereo matching algorithm. However, this method requires using the image information to restore robot motion information, the robust motion estimation is difficult to achieve [20]. Visual pose estimation can fail suddenly, due to bad lighting and texture conditions of the scene or fast movements that cause image blur. Supporting the camera measurements by a proprioceptive sensor like an AHRS should close the gaps where the image-based motion estimation fails and allow also for an increased update rate of the motion estimation. So this paper proposes a method combine the AHRS with binocular vision to achieve the SLAM of the robot. It operates in real-time at more than 30 frames per second by leveraging a combination of data parallel algorithms on the GPU, parallel execution of compute intensive operations and producer/consumer thread relationships that effectively use modern multi-core CPU architectures. Integrates the information [21] extract by vision sensor and AHRS to achieve the robot simultaneous localization and mapping.

An overview over the proposed method is shown in Fig. 1. The binocular cameras are calibrated offline by the method proposed in [22], and then get the internal and external parameter of cameras. The rectified left and right images are used to detect feature points with the GPU-based accelerated SIFT operator. This produces typically several hundred of feature points. The SIFT detector is known for its reliability to detect the same features again (i.e. no flickering of feature point in consecutive images), which is important for this application. According to the triangulation principle and the disparity of feature points to generate 3D feature point cloud data and then build the feature map of robot motion. As the six-legged robot movement speed is not fast, so in order to reduce the amount of computation and ensure the accuracy of motion estimation, the camera is triggered synchronously at intervals of 2 s to capture scene images. For current frame and previous frame (depends on experiment), apply ICP algorithm to calculate the rotation and translation matrix of feature map.

Traditional SIFT algorithm include four main stages to extract feature points. Scale space construction, keypoint detection and localization, keypoint orientation assignment and keypoint descriptor. We introduce and analyze them in particular as follows.

  • (A)

    Scale space construction

Scale space defined as a function L(x, y, σ), which is produced from the convolution of variable-scale Gaussian, G(x, y, σ), with an input image, I(x, y): L(x,y,σ)=G(x,y,σ)*I(x,y)where * represents convolution and G(x, y, σ) represents Gaussian filter function. G(x,y,σ)=12πσ2e(x2+y2)2σ2where (x, y) represents image coordinates, σ is the scale factor and its size determines the degree of smoothing of the image. The Difference of Gaussian (DoG) Scale-space define as, D(x,y,σ)=(G(x,y,kσ)G(x,y,σ))*I(x,y)=L(x,y,kσ)L(x,y,σ)where σ and are adjacent scales, and k is a constant multiplicative factor. The image scale space is expressed as discrete image pyramid which consists of several octaves, and the octave is comprised of several levels. The DoG pyramid is created by subtraction between adjacent images in Gaussian image scale space.

  • (B)

    Keypoint detection and localization

Once DoG images have been constructed, keypoint is identified as local minima/maxima of the DoG images. Each pixel in the DoG images is compared with its 8 neighbors at the same level and 3 × 3 corresponding neighboring pixels at the two adjacent levels. If the pixel value is the maximum or minimum among all compared pixels, it is selected as a candidate keypoint.

The interpolation of keypoint is done using the quadratic Taylor expansion of the DoG scale-space function. D(X)=D+DTXX+12XTDX2X

Then, the location of the extreme point X^ is determined by taking the derivative of this function with respect to X and setting it to zero. X^=2D1X2DXD(X^)=D+12DTXX^

The DoG function has a strong edge response. Therefore, in order to increase stability, Hessian matrix is used to eliminate the candidate keypoints that have poorly determined locations but have high edge responses. H=(DxxDxyDxyDyy)Tr(H)=Dxx+Dyy=α+βDet(H)=DxxDyy(Dxy)2=αβwhere α represent bigger eigenvalue, β represents smaller eigenvalue. Supposed that α=γβ, the ratio R can be defined as, R=Tr(H)2Det(H)=(α+β)2rβ2=(r+1)2r

It follows that, for some threshold eigenvalue ratio rth, if R for a candidate keypoint is larger than (rth+1)2/rth, that keypoint is poorly localized and hence rejected.

  • (C)

    Keypoint orientation assignment

Each keypoint is assigned one or more orientations based on gradient directions of local image. The keypoint descriptor can be represented relative to this orientation and therefore achieve invariance to image rotation. For each image, L(x, y), at this scale, the gradient magnitude, m(x, y), and orientation, θ(x, y),is precomputed using pixel differences: m(x,y)=sqrt((L(x+1,y)L(x1,y))2+(L(x,y+1)L(x,y1))2)θ(x,y)=tan1((L(x,y+1)L(x,y1))/L(x+1,y)L(x1,y))

  • (D)

    Keypoint descriptor

A set of orientation histograms are created on 4 × 4 pixel neighborhoods with 8 bins each. These histograms are computed from magnitude and orientation values of samples in a 16 × 16 region around the keypoint such that each histogram contains samples from a 4 × 4 sub-region of the original neighborhood region. The magnitudes are further weighted by a Gaussian function with equal to one half the width of the descriptor window. Since there are 4×4=16 histograms each with 8 bins, the vector has 128 elements. This vector is then normalized to unit length in order to enhance invariance to affine changes in illumination.

In this section, the whole details about SIFT acceleration method based on GPU is presented. We make full advantage of GPU's abilities of parallel computation, float point computation, memory management, and give reasonable resource allocation to host (CPU) and device (CUDA) in the implementation of SIFT. An overview of the steps in the GPU-based SIFT method's implementation as shown in Fig. 2. The scale space construction, keypoint detection and localization, keypoint orientation assignment and keypoint descriptor are implemented in the device (CUDA).

Firstly, the image data is loaded from the host memory to the device memory and bound to texture memory. The advantage of that we chose texture memory are, the texture memory cache on the chip, it can reduce the memory request and provide more efficient memory bandwidth, to maintain a high performance of random read. On the other hand, when Gaussian filter is applied to 2D image array data, it is necessary to make judgment on image bounds. While the GPU is not good at executing conditional statements, using texture memory through property settings, we can deal with such problems automatically and efficiently.

The different scales of Gaussian kernel were uploaded to the constant memory of the device. According to the separability of Gaussian kernel function, the two-dimensional Gaussian convolution was decomposed into two one-dimensional Gaussian convolutions to achieve image filter [23] . Each image was divided into a series of image block with width W, height 1 and height H, width 1 as shown in Fig. 3. Each image block was processed by one thread block, and each thread in the thread blocks processed a row or a column of the image block. The sum of the horizontal and vertical Gaussian filter result is equal to the Gaussian filter result of the entire image. According to the layers, groups, Gaussian pyramid were assigned to a different block. Difference values of each pixel were processing in parallel and then sum the results of each block to accelerate the building of the Gaussian scale space pyramids.

Assume that Gaussian pyramid had O octaves and each octave had S layers. After the subtracted of the adjacent Gaussian image, we can get S1 difference-of–Gaussian (DoG) image. To detect the local maxima and minima of DoG each point is compared with the pixels of all its 26 neighbors. We firstly detect the extreme points in its 8 neighborhood, experiments showed that 95% candidate pixels can be eliminated. And then, we do extreme points detection in its 26 neighborhood for the kept pixels. The kernel functions can be circularly called S3 times to detect local keypoints. In order to reduce the judgment of wrap and improve the efficiency of the detection algorithm, each block processing 16 × 16 sized image block. If the width or the height of the image cannot exact divide by 16, filling the image edge with pixels which grey value are 0 as shown in Fig. 4.To a large extent, this method can improve calculation efficiency and save time. Because the low processing efficiency of device in branch and logical judgment, the extreme points which calculated preliminary were read back to the host to achieve precise positioning. Then the result of further selection was transferred back to device and saved in global memory.

Allocate the precise extreme point which has calculated in host to the device’s block. Each keypoint is processed in one block to calculate gradient orientation and magnitude of the pixel. Each block is divided into 16 × 16 threads and each thread processed one pixel.

We divide the square region around feature point whose edge is 12σ into 4 × 4 square sub-regions whose edge are 3σ. An 8-orientation histogram is generated by calculating the contribution of gradient orientation of each pixel to the orientation histogram in a sub-region. So we can obtain 4 × 4 × 8 values which compose the 128-dimensional vector. Considering that the histograms are data-independent from each other, we use one thread to process a sub-region. In the allocation strategy, there are 64 threads in a block to process four feature points. A thread is used to compute the weight of all the pixels in the sub-region and transfer eight pillars of histogram to global memory. Then the processing results of the 16 threads could generate a 128-dimensional feature vector.

The KD-tree search is more time-consuming in SIFT feature matching. The KD-tree is built in CPU and bound to the texture memory. Using Euclidean distance as a measure to find two nearest feature points in GPU, and then the data is stored in shared memory of each block. This can effectively improve the speed of data access. Each block contains 128 threads and each thread responsible for a feature point search. Thread assignment of feature matching is shown in Fig. 5. When the number of feature points less than 128, zero-padding operation is adopted to meet the minimum computing needs of a block, reducing the overhead of judgment. When the number of feature points is much greater than 128, using multiple block to parallel computing.

In traditional SIFT implementation, for a certain feature points in one image it must compare with all feature points in another image to find its corresponding. Each feature point has 128 dimensional data and the calculation amount is very huge. The runtime of the matching method is not suitable for real-time VSLAM.

Although there is aberration in the captured images, when we add the epipolar line constraint [24] the matching feature points are usually on the same epipolar line or nearby(deviation of corresponding feature point P does not exceed ± δ pixels, δ=3 based on experiment). The search scope of feature point matching only along epipolar line and it can reduce the amount of calculation effectively. In order to remove redundant match point, sequential consistency constraints and unique constraints were added. The sequential consistency constraint is the corresponding feature points in the same epipolar line of two images have the same order. Unique constraint requires that a feature in one stereo half-image be matched to, at most, one similar feature in the other half-image.

To evaluate the performance of GPU-based SIFT method, tests are performed on ATI (Radeon HD 5450) and NVIDIA (GeForce GTX 430) graphics card. These tests showed an improvement of one order of magnitude in speed over a standard SIFT implementation. A 10X speedup over the CPU version is observed on the GPU-based graphics card. The feature points matching result of the real scene image as shown in Fig. 6. Fig. 7 compares the runtime between the CPU and GPU implementations for a range of image resolution. GPU-SIFT running on the NVIDIA GeForce GTX 430 could extract about 1000 SIFT features from streaming 900 × 750 resolution video at an average frame-rate of 30 Hz. Fig. 8 shows the running time of different steps in the GPU-based SIFT algorithm with different image resolutions. As shown in Fig. 9, the evaluation results shows that currently the NVIDIA graphic cards outperform the tested ATI graphics cards. This is due to the number of the stream processor and texture unit of NVIDIA graphic cards are superior to ATI. Our proposed algorithm for different graphics cards still have good acceleration effect, with the enhancement of graphics performance, acceleration performance of the proposed algorithm has been further reflected. The runtime of GPU-based SIFT method can meet the requirements of the six-legged robot`s real-time VSLAM.

We also compared the performance of the proposed GPU based SIFT method against SIFT [11], PCA-SIFT [25], ORB [26], Wu [27] and Acharya et al. [28] with the same conditions. As shown in Fig. 10, we found that the proposed GPU-based method outperformed the other methods; it achieves acceleration up to 10X and can extract 1000 features at an average frame rate of 30 Hz.

Section snippets

Pose estimation implementation

We trigger the camera to capture images ahead of the robot in a certain time interval and the time interval determined by the speed of robot. The initial correspondence step provides a set of corresponding points, whose 3-D position is known in respect to their camera coordinate system. The motion between two camera coordinate systems is described by a rotation R and a translation t. The relation between the points of the previous and current Frame (Ci and Pi, with corresponding points having

Testing platform

The VSLAM methods described in this paper are implemented on the experimental, insect-like hexapod robot, as shown in Fig. 11. Each of the six legs has three joints with three active degrees of freedom and a three-axis force sensor on each foot to sense the interaction force with terrain. All joints are driven by permanent magnet synchronous motors in combination with harmonic drive gears. Within each joint these are a motor angle sensor, a link side joint angle sensor as well as a joint torque

Conclusions

The pose estimation of the robot plays an important role in the deliberative systems where the robot measures the state of the environment and plans its motion. The GPU-based real-time SLAM using stereo vision and AHRS as input sensors enable the HIT-II six-legged robot robust walking on rough terrain and mapping. The GPU-parallel method using CUDA achieves acceleration up to 10X contrast to traditional SIFT method and it can extract about 1000 SIFT features from images with 900 × 750

Acknowledgment

This work is partially supported by the National Natural Science Foundation of China (NSFC) “Environment modeling and autonomous motion planning of six-legged robot” (No. 61473104). Support by Self-Planned Task (NO.SKLRS201410B) and (NO.SKLRS201501A02) of State Key Laboratory of Robotics and System (HIT). Support by National Magnetic Confinement Fusion Science Program "Multi-Purpose RemoteHandling Systemwith Large-Scale Heavy Load Arm" (2012GB102004). The authors also gratefully acknowledge the

Zhang Xuehe received B.S., M.S., degrees from Liaoning University of science and technology in 2008 and 2011 respectively, all in computer science and technology. From September 2011 to now, he is a PhD student of Harbin Institute of Technology. Hiss research interests include the joint area of computer graphics and vision, including 3D shape reconstruction from multiple views, computational photography, and Stereo vision navigation and path planning of the six-legged robot.

References (29)

  • J. Sturm et al.

    Evaluating egomotion and structure-from-motion approaches using the TUM RGB-D benchmark

  • R.A. Newcombe et al.

    KinectFusion: Real-time dense surface mapping and tracking.

  • D.G. Lowe

    Distinctive image features from scale-invariant keypoints

    Int. J. Comput. Vis.

    (2004)
  • W. Zhang et al.

    Solving energy-aware real-time tasks scheduling problem with shuffled frog leaping algorithm on heterogeneous platforms

    Sensors

    (2015)
  • Cited by (8)

    • 3D Reconstruction system for collaborative scanning based on multiple RGB-D cameras

      2019, Pattern Recognition Letters
      Citation Excerpt :

      The SIFT feature extraction algorithm is more stable and has rotational invariance. In terms of feature extraction, the SIFT feature extraction algorithm based on GPU acceleration is addressed [11]. The Bag Of Words (BOW) was used in loop closure detection.

    • Special Issue on Real-Time Scheduling on Heterogeneous Multi-core Processors

      2016, Microprocessors and Microsystems
      Citation Excerpt :

      The framework addresses the acceleration problem by decomposing a volume rendering algorithm into several data-parallel stages processing multi-scale streams, which are mapped efficiently to the massively parallel architecture of modern GPUs. Zhang et al. in [6] propose a new adaptive estimation algorithm to achieve the robot SLAM by GPU. A novel acceleration algorithm for SIFT implementation based on Compute Unified Device Architecture (CUDA) is presented to detect the matching feature points in 2D images.

    • Design and Implementation of Binocular Vision System with an Adjustable Baseline and High Synchronization

      2018, 2018 3rd IEEE International Conference on Image, Vision and Computing, ICIVC 2018
    View all citing articles on Scopus

    Zhang Xuehe received B.S., M.S., degrees from Liaoning University of science and technology in 2008 and 2011 respectively, all in computer science and technology. From September 2011 to now, he is a PhD student of Harbin Institute of Technology. Hiss research interests include the joint area of computer graphics and vision, including 3D shape reconstruction from multiple views, computational photography, and Stereo vision navigation and path planning of the six-legged robot.

    Li Ge received Ph.D. degrees from Harbin Institute of Technology in 2008, in Mechanical and Electronic Engineering. Since September 2008, he has been with the School of Mechanic and Electronic Engineering, Harbin Institute of Technology, where he is an assistant professor. His research interests include the, multi-robot motion planning and control, robot vision and image parallel processing based on GPU.

    Liu Gangfeng received Ph.D. degrees from Harbin Institute of Technology in 2010, in Mechanical and Electronic Engineering. Since September 2010, he has been with the School of Mechanic and Electronic Engineering, Harbin Institute of Technology, where he is a lecturer. His research interests include the robotic tele-operation technology and space manipulator technology.

    Zhao Jie received B.S.,M.S., and Ph.D. degrees from Harbin Institute of Technology in 1990, 1993, and 1996, respectively, all in Mechanical and Electronic Engineering. Distinguished professor of the cheung kong scholars programme of ministry education. Leader of advanced manufacturing intelligent robot of the national ‘twelfth five-year’ and ‘863 plan’. His research interests include the multi-sensor system integration and control technology, robotic tele-operation technology based on Internet network.

    Hou Zhenxiu received B.S.,M.S., and Ph.D. degrees from Harbin Institute of Technology in 1982, 1991, and 2006, respectively, all in Mechanical and Electronic Engineering. Since September 1982, she has been with the School of Mechanic and Electronic Engineering, Harbin Institute of Technology, where he is an professor and doctoral supervisor. Her research interests include the Aerospace mechatronics, aerospace materials engineering, reliability optimization of the aerospace sealing mechanism and components.

    View full text