Visual navigation method for indoor mobile robot based on extended BoW model

: This article proposes a new navigation method for mobile robots based on an extended bag of words (BoW) model for general object recognition in indoor environments. The scale-invariant feature transform (SIFT)-detection algorithm with the graphic processing unit (GPU) acceleration technology is used to describe feature vectors in this model. First, in order to add some redundant image information, statistical information of the spatial relationships of all the feature points in an image, i.e. relative distances and angles, is used to extend the feature vectors in the original BoW model. Then, the support vector machine (SVM) classifier is used to classify objects. Also, in order to navigate conveniently in unknown and dynamic indoor environments, a type of human – robot interaction method based on a hand-drawn semantic map is considered. The experimental results show that this new navigation technology for indoor mobile robots is very robust and highly effective.


Introduction
In the research field of mobile robots, navigation is very important and necessary, with its aim to make a robot move to the desired destination and while completing specified tasks as expected.
The first step in navigation is to build a model of the environment which can be a grid, geometric, topological, or 3D one. However, Thrun proposed a hybrid method that combines grid and topological models [1,2], using the grid to represent local features and a Voronoi graph [3] topological structures. Wu Xuejian et al. also proposed a visual navigation method for mobile robots based on a hand-drawn map [4], which provides the start and end information and approximate distance between these points, the approximate positions of landmarks the robot might meet during navigation etc. Relative to traditional map-building models, such as grid, geometric, and topological ones, we use the convenient hand-drawn map [5]. Although some researchers used artificial [6][7][8][9][10] and quasi-artificial landmarks (with some taken photos of natural ones in advance [4,5]), recognising natural ones is essential for navigating successfully in real environments. However, unlike an artificial landmark, a natural one is not labelled.
Also, we should not restrict natural landmarks to specific objects due to the changeable environment; in other words, even if they change to some extent or adopt another form, a robot should still be capable of recognising them using a visual sensor. In order to achieve our goal, we solve the problem of general object recognition in a real environment.
The bag of words (BoW) model is effective arithmetical means of general object recognition because of its simple strategy and robustness for determining an object's position and deformation in an image. However, as each feature is independent of the others in it, there is no spatial relationship to consider whereas such relationships among features could be useful for describing the internal structures of objects or highlighting the importance of their contextual visual information. Although research on this theme is becoming more and more popular [11][12][13][14][15][16][17][18][19], there is still room for improvement.
Here, we propose a novel extended BoW method which works in the following statistical way. A multi-dimensional vector is used to describe an image with all its elements divided into two parts: one describes its local features; and the other is the spatial relationships among them. Then, we use the support vector machine (SVM) classifier to train our model to obtain discriminant functions and, taking real time into account, graphic processing unit (GPU) acceleration arithmetic to speed up image processing. Finally, this method is successfully applied to indoor mobile robot navigation.

Building model of environment
Building a model of the environment aimed at planning a path for a mobile robot involves feature extraction and information representation.
It is not necessary for many biological systems, such as those of human beings, butterflies, and bees, to obtain precise distance information when perceiving their environments through their visual systems as they navigate by remembering some key landmarks based on qualitative analysis.
Based on the method proposed in [5], we design a hand-drawn map for guiding a robot's navigation with the advantage that we do not need to input detailed environmental information into the robot and enables our navigation model to handle dynamic situations, such as when some landmarks change often or a person walks without stopping around the robot. We require only the starting point and orientation of the robot, the route and its approximate physical distance, and rough estimates of the locations of landmarks the robot might encounter during navigation.
Our novel extended BoW model for general object recognition proposed to overcome the challenge of recognising natural landmarks is discussed in the next section. such as local parts and spatial relationships with others. While a human being can understand the advanced semantic features of a picture, a computer can comprehend only the raw information in an image. However, it is still very helpful to refer to the human vision system because the course of general object recognition is analogous to a human's judgment, that is, first, descriptors of general objects are established, then their categories determined through machine learning and, finally, the learned model used to classify and recognise new objects [20].

Overview of recognition algorithm
The method proposed here, the framework of which is presented in Fig. 1, follows the principle of general object recognition, that is, first, it describes an image, then learns the object model and, finally, classifies objects.

Building vision code library
In 1999, Professor G. David [21] proposed the scale-invariant feature transform (SIFT) algorithm based on the scale space, which is invariant in terms of translation, rotation and zoom, and then improved it in 2004 [22]. It is widely used for object recognition, and has a very strong image-matching capability with the following general characteristics.
(a) A SIFT feature is a local feature of an image which remains invariant of translation, rotation, zoom, illumination, occlusion, and noise and even maintains some degree of stability with a change in vision and affine transformation. (b) It has an abundance of information which is very helpful for fast and accurate image matching. (c) It is fast which may satisfy the real-time requirements of image matching after optimisation. (d) It has strong extendibility which may be associated with other feature vectors.
Therefore, here, we choose the SIFT algorithm to detect key points; for example, if we want to construct a code library of cars, first we choose pictures of different cars from different views and then detect key points using this algorithm.
After completing feature detection, we need to establish a vision code library using the large number of 'words' we obtain from detecting many images. However, as some SIFT descriptors are similar to those of other algorithms, it is necessary to cluster these vision words. K-means is a common clustering method [23] which aims to group elements in K centres according to the distance between an element and a centre, and obtain the clustering centres based on continuous iterations; for example, if the elements for clustering are (x 1 , x 2 , x 3 , ..., x n−1 , x n ) and every element is a d-dimensional vector, they are expected to cluster into K sets according to the sum of their minimum average variances (s = {s 1 , s 2 , s 3 , ..., s k−1 , s k }) as: where m i is the average of s i [24]. Based on experience, we adopt k = 600 and complete the construction of a vision code library in which every code is a 128-dimension SIFT description vector after K-means clustering.

Pre-processing images
As a computer can understand only raw information, acquiring an advanced meaning that reflects an object's appearance is the major difficulty of general object recognition.
As previously mentioned, we obtain a vision code library of certain objects and, before describing an image, compute the similarity between a local feature and every word in the library. If a similarity meets the threshold, this local feature is considered a key point that belongs to this image; for example, if there are N vision codes in a library and M local features in an image, the pseudo-code is as follows.
where P i denotes the SIFT descriptor of the ith local feature in an image and Q j that of the jth vision code in the library. We define After performing local feature extraction, we take the normalised operation for every SIFT descriptor and believe that, although the remaining local features belong to the image, some are produced from the background. Therefore, an extra operation is required to delete them if: (i) the number of local features obtained from an image is much greater than that from the background after computing similarities; or (ii) we want to further reduce the background disturbance based on the density distribution of the local features. Fig. 2 shows the results if we obtain T local features from M ones after computing similarities. Obviously, if we want to obtain the object in the rectangle despite some disturbance, we may use RANSAC to reduce the negative influence on later image descriptions. For convenience, we use a circle to cover the area where the density of features is very high using the following pseudo-code.
While iteration , Times The points inside the model equal to the key points that were randomly selected from T datum, with the possible centre of a circle: The radius (R) can be defined as the maximum of the distances between the key points in the model and the possible centre of a circle.
For every key point that doesn't belong to the model, if the distance is <1.2R, consider that it does belong and add 1 to the number of key points in the model.
If the number of key points in the model is larger than E(E = 80% × T ), consider this model to be correct and save the possible centre of a circle and key points.
For j < number of correct models: If the distance (r, r [ [1, T ]) is the minimum of all distances, save the model's possible radius (r) and 80% of its key points closest to the possible centre and consider that they belong to the image.

Describing images
In our BoW model, a multi-dimensional vector was used to describe an image with all its elements grouped in two parts, one of which describes its local features and the other describes the spatial relationships among them.
(a) The local features are described based on the numbers of times words appear in the vision code; for example, if there are (x 0 , x 1 , x 2 , ..., x P−2 , x P−1 ) vision words in the library, the dimension of the vector is P, with the number of each dimension denoting the number of times the corresponding vision code appears. (b) To describe the spatial relationships among the local features, we use the distance between a key point and the centre of a circle, and the relative angle. The new centre of the key points is expressed in (5), where m indicates the number of key points after processing, and the geometric centre is the centre of a circle. As shown in Fig. 3, the marks around it denote key points, for example, for the five stars in the upper right corner, the corresponding distance and angle are L and u, respectively.
The Euclidean distances between every key point and the geometric centre (x c , y c ) were calculated as the distance, i.e. from (L 1 , L 2 , L 3 , ..., L m−1 , L m ). We take the middle distance as the unit length (L) and divide all the remaining ones into 0 − 0.5L, 0.5L − L, L − 1.5L,1.5L − MAX according to the ratio L i /L.
The dimension vectors (P and Q) stand for the numbers of times each vision word and a certain spatial relationship appear in an image, respectively. Since the distances and angles are relative and the descriptions of spatial relationships are invariant of translation, rotation and zoom.

Obtaining discriminant function
During navigation, the camera continually acquires images and arrives at judgments according to the discriminant function which is trained offline and obtained as follows.
Generally speaking, there are two kinds of classifiers that depend on the degree to which a human participates during learning, i.e. supervised and unsupervised. As the SVM has attracted a great deal of attention and recently achieved some good results, we use it to train our models as supervised classifiers. Its aim is to classify two kinds of patterns as far as possible according to the principle of the minimum structural risk by constructing a discriminant function. The , which can be separated by a hyperplane ((w · x) + b = 0), with a linear hyperplane at a distance (D) from the samples: Then, the distance between two linear hyperplanes is M = 2D/ w . During the course of our training, images containing objects are regarded as positive inputs and others negative ones. From them, we can obtain the trained SVM discriminant function for general object recognition offline which is very helpful for a robot's navigational recognition.

Recognising landmarks online
The discriminant functions obtained offline, including those of many different kinds of objects, are used to build a database which a robot uses to recognise landmarks. During the running of a robot, its camera continually acquires image information and then every image is processed using the SIFT algorithm to obtain feature points. After computing similarities, the feature points that meet the specified threshold are saved and calculated according to RANSAC arithmetic to reduce background disturbance. Every image is represented as a P + Q dimension vector recognised by the offline discriminant functions in the database. Then, a series of recognition results is obtained to localise the robot itself, the framework of which is shown in Fig. 4.
We can summarise our recognition algorithm in two parts, i.e. offline and online.   B t }, B, where B is a set of training images not containing target objects with every image marked as −1 when trained. This phase involves the following three steps: (a) A visual code library of image I was generated; (b) every image in A and B was represented as a multi-dimensional vector with background disturbance reduced; (c) the SVM was used to train images and finally obtain the discriminant functions.
(ii) Recognition of landmarks online: (a) an image was obtained, and its background disturbance was reduced and represented as a multi-dimensional vector; and (b) landmarks were recognised by using discriminant functions.

Navigation arithmetic
The navigation flowchart shown in Fig. 5 is clearly explained in [5]. However, our proposed general object recognition method for recognising natural landmarks for robot navigation is different and requires photos of landmarks to be taken manually in advance and regarded as image-matching templates.

GPU acceleration during image processing
SIFT algorithm was used mainly to detect and acquire key points in the course of general object recognition by a 128-dimension vector. These points are some local extreme ones containing orientation information which are detected in different scale spaces of an image and can be described from the three views of scale, size and orientation. As this process takes a long time, it is necessary to use GPU acceleration to speed up the SIFT algorithm which occupies most of our algorithm's time.
The GPU is a concept with a great advantage over a central processing unit (CPU) for image processing and the NVDIA released an official development platform, i.e. CUDA, on which we can compute SIFT code in a parallel manner. We test images of different sizes with different numbers of key points using: operating system -32 bit win7; flash memory -2G; CPU -Intel(R) Core (TM) 2Duo E7500@2.93 GHz; GPU -nVIDIA GeForce310;  In the testing results in Table 1, the acceleration is obvious when an image is large and has many key points.

Experimental environment
A Pioneer 3-DX mobile robot equipped with a PTZ monocular colour camera, 16 sonar sensors, a speedometer, an electronic compass etc., is chosen for our experiments which are conducted in the SEU mobile robot laboratory shown in Fig. 6. The size is ∼10 m × 8 m.

Specific experiments
5.2.1 Experiment 1: As our proposed algorithm aims to recognise general objects in this experiment, its main goal is to recognise different ones in the same category. This experiment is conducted three times and there are five kinds of key landmarks each time, i.e. chair, guitar, wastebasket, umbrella, and fan, which might be changed every time by retaining their corresponding same categories. The robot always starts from the lower left-hand corner and finishes at the upper right-hand one. The hand-drawn map and three paths it runs, respectively, during its navigation were shown in Fig. 7.
As can be seen, even if the landmarks are changed by their corresponding same categories, our algorithm can still work well. Therefore, the robot always reaches its destination successfully.

Experiment 2:
The robustness of the navigation algorithm is tested when the landmarks were moved slightly. In the first navigation, all the positions of the landmarks and hand-drawn route remain unchanged while; in the second one, they are moved 1 m to the left and; in the third one, they are moved 1 m to the right.
As can be seen in Fig. 8, the robot can still reach its intended destination even if the landmarks are moved slightly from their original positions.

Experiment 3:
The robot's navigation performance is tested by reducing the number of landmarks from 5 the first time to 4 the second time and 3 the third time.
As can be seen in Fig. 9, there is almost no effect on the path when the number of landmarks is reduced. However, if the environment is    too large and the number of landmarks too few, the robot might not perform very well in terms of navigation according to the sketched map due to lacking of some necessary information. In this case, the odometer plays an important role.

Experiment 4:
By setting obstacles in the path the robot might cover according to the hand-drawn map, we aim to test its capability to avoid them by recording its reactions when facing them. As shown in Fig. 10, the robot tries to avoid the obstacles it might encounter according to the navigation algorithm by using sonar sensors.

Experiment 5:
To further test the robot's navigational performance, first, the landmarks on the hand-drawn map correspond to those in a real environment; second, the fourth landmark is the rubbish bin while the real one is the chair.
As can be seen in Fig. 11, during its second navigation, the robot cannot find the fourth landmark because it does not correspond to what is shown on the hand-drawn map. However, the robot arrives at its destination based on other landmarks and the odometer as well as the map.

Conclusions and future work
Here, we proposed a novel general object recognition arithmetic method using offline training and learning and online recognition which is helpful for recognising natural landmarks and even human-robot interactions. We successfully applied it to robot navigation based on a hand-sketched map and natural landmarks, with the experimental results proving its advantages. Due to rapid changes in the world over time, categories of objects have become increasingly complex. As our current level of recognition arithmetic is very limited, it is important that our next work is on online learning and training for recognition.