Pose-guided End-to-end Visual Navigation

End-to-end visual navigation based on deep reinforcement learning (DRL) has recently attracted much attention. For most existing navigation methods, a robot moves along only fixed directions (e.g., up, down, left and right) on a grid. Obviously, they are not flexible and efficient, which worsens the navigation performance (i.e., the distance of movement and times of rotation). To address this problem, we propose a novel pose-guided end-to-end visual navigation framework, which is flexible and efficient. In the pose-guided navigation framework, a robot can move along arbitrary directions, which are determined by poses between adjacent objects. Further, to select a proper motion and finally form an optimal path, we propose a DRL based action-selected strategy, where a dynamic action select space on the basis of deep siamese actor-critic network is developed. Besides, to validate the proposed method, we propose a novel pose-guided dataset. Experimental results demonstrate that the proposed method outperforms the state of the arts in both flexibility and efficiency.


Introduction
Visual navigation plays an important role in diverse fields such as mobile robots [1], unmanned aerial vehicles(UAV) [2] and unmanned ground vehicles(UGV) [3]. Generally, traditional navigation approaches have modular pipelines [4]: firstly building maps [5], then planning paths [6] and finally controlling movements [7]. These map-based navigation approaches rely heavily on the qualities of the maps and are time-consuming. Compared to the map-based navigation approaches, end-to-end visual navigation based on DRL can directly obtain control policies according to input images, which are concise and efficient [8]~ [10]. Thus, end-to-end visual navigation based on DRL has attracted much interest in recent years.
Although end-to-end navigation has achieved great success in unknown environment navigation [11]~ [13] and dynamic avoiding collision [14], most existing methods interact with the environment by fixed motions (e.g., up, down, left and right) and are restricted to move on a grid [15]~ [18], which degrades navigation flexibility and efficiency. Because of these problems, end-to-end navigation has not IWECAI 2021 Journal of Physics: Conference Series 1873 (2021) 012011 IOP Publishing doi: 10.1088/1742-6596/1873/1/012011 2 been utilized widely, which motivates us to develop a flexible and efficient end-to-end navigation method.
In this paper, we propose a novel pose-guided end-to-end visual navigation framework, where a robot can move along arbitrary directions according to poses between adjacent objects. Intuitively, the proposed method is more flexible and anthropomorphic, as shown in Fig. 1. Moreover, the DRL based action-selected strategy is leveraged to learn the policy of selecting optimal actions, which is improved based on deep siamese actor-critic network [19] by setting dynamic action select space. In order to accomplish the pose-guided navigation framework, we make a pose-guided dataset. As far as we know, the pose-guided dataset is proposed firstly. To fill the aforementioned void in the community, we public our proposed dataset at https://github.com/CASHIPS-ComputerVision/Pose-guided-dataset.
The novel contributions are summarized as follows. 1) A novel flexible pose-guided navigation framework is proposed, where a robot can move along arbitrary directions determined by poses between adjacent objects.
2) A DRL based action-selected strategy is developed to select a proper motion and finally form an optimal path, where a dynamic action select space is set by using deep siamese actor-critic network.
3) To validate the proposed method, a novel pose-guided dataset is proposed. Experimental results demonstrate that the proposed method outperforms the state of the arts in both flexibility and efficiency. The rest of this paper is organized as follows. The related work is introduced in Section 2. The detail of the proposed pose-guided end-to-end visual navigation method is presented in Section 3. The experimental results are provided in Section 4. And in Section 5, we describe conclusions.  [19] and right: the trajectory(blue) of the proposed method.

Related Work
In this section, we describe the related work on visual navigation. There are two groups: traditional visual navigation and end-to-end visual navigation.

2.2.End-to-end Visual Navigation
End-to-end visual navigation bypasses the modular steps and directly learns navigation strategies with deep networks. Pomerleau et al. [29] proposed the pioneer ALVINN system in 1989, which uses multilayer perception to learn the directions a vehicle should steer. Afterwards, deep learning based endto-end navigation have achieved excellent success, due to the impressive convolutional neural networks (CNNs) and powerful computation power. Brahmbhatt et al. [30] presented a city-level end-to-end navigation system based on CNN, which uses local visual streetscape images for navigation. Xu et al.  [31] demonstrated an end-to-end method, which obtains future vehicle egomotion from a large-scale video dataset based on CNN.
Due to the significant recent development in DRL, DRL based end-to-end navigation methods were presented [9], [10]. Among these works, A3C [32] and PPO [33] are two widely utilized DRL algorithms, because A3C has fast computing speed for parallel tasks and PPO has excellent performance for continuous control tasks. Mirowski et al. [34] presented an end-to-end navigation system by training an Asynchronous Advantage Actor-Critic(A3C) agent in a 3D maze. Luo et al. [35] developed a visionbased target-driven expert policy by training PPO algorithm in the Habitat simulator. In addition to DRL algorithms, datasets play an important role in end-to-end navigation. Several datasets were proposed to improve navigation performance, for example AI2-THOR [36], Streetlearn [37] and AdobeIndoorNav [38]. With the assistant of AI2-THOR [36], Zhu et al. [19] proposed a deep siamese actor-critic model, which improves the generalization of end-to-end navigation.
In summary, DRL based end-to-end methods have gained much attention so far. Different from most existing methods move on a grid, we propose a novel flexible pose-guided end-to-end visual navigation method, by using the poses between adjacent objects to guide movement arbitrarily.

Method
In this section, we detail the proposed pose-guided end-to-end visual navigation framework, as shown in Fig. 2. First, the observation and target images are the input. Then, according to poses between adjacent objects from the proposed pose-guided dataset, the action space is generated. Afterwards, we utilize the DRL based action-selected strategy to select the optimal actions. Finally, all selected actions form an optimal navigation path as the output.

3.1.Pose-guided Dataset
The proposed dataset is the discrete representation of the environment, which selects the reachable objects as discrete points. At each discrete point, RGB and depth images of the surrounding scenes are captured. All RGB images represents the overall state space of the robot. And all depth images are input into VoteNet [39] to obtain poses between adjacent objects. VoteNet is the state-of-the-art network for 3D target detection, which uses the deep Hough voting model to achieve 3D object detection in a point cloud. The poses are contained by the output 3D bounding box of VoteNet. The process of building the pose-guided dataset is shown in Fig. 4.

3.2.Action-selected Strategy
In this section, we propose a DRL based action-selected strategy to learn the policy that enables the robot to select the optimal action at each state and generate the navigation path. The strategy is implemented in the proposed pose-guided dataset.
Generally, in the DRL based action-selected methods, the robot learns the optimal action policy by interacting with the environment. The optimal action policy is learned by maximizing the expected sum of rewards: is the discount factor, is the reward obtained by the robot executing t a in the state of t S , t S is the state at time t and is the action generated by following some policies . Inspired by the target-driven navigation model [19], we propose a novel DRL based action-selected strategy. The original target-driven navigation model combines siamese network and actor-critic network to realize end-to-end visual navigation. In order to improve it further, we set dynamic action space and improve the reward function. These improvements include three parts: action space, reward function and network architecture. We describe them in detail as follows.
Action space. By setting a dynamic action space, the robot can select actions flexibly at each state. To constrain different possible states and allow off-line batch training, we consider the actions as discrete movements in the exploration space. Different from actions in other navigation methods, actions in the proposed method are guided by poses between adjacent objects. Moreover, we set two basic actions at each state: turn left 2  and turn right 2  . When the robot executes the action (i.e., rotating  degrees and walking d meters) to get into next state, we rotate the robot in reverse to ensure that the observation of the robot belongs to the proposed dataset. In order to make reward dimensions consistent, we supplement the action space to the maximum action space capacity in the scene and replace the missing actions by a fixed value.
Reward function. To accomplish the low navigation cost, we adopt a straight-forward reward function design. When the robot reaches the target state, it will obtain a reward (10.0). And we added a small amount of time penalty (-0.01) as an instant reward. To prevent the robot from choosing an invalid action, we set the reward for implementing the action to -100. Furthermore, we add a step to identify whether the action is valid in navigation.
Network architecture. The architecture of the proposed DRL based action-selected network is shown in Fig. 5. In the network, we take observation and target images as the inputs. Two ResNet-50 networks and a full connection layer consist of the generic siamese network. The ResNet-50 and full connection layers share parameters. Finally, a dynamic policy and a reward are output through two FC layers.

Experiment
In this section, we compare our method with the state-of-the-art methods: A3C [32], PPO [33] and a target-driven navigation method [19]. Since existing public datasets do not contain poses between adjacent objects, we perform experiments in the proposed pose-guided dataset, as shown in Fig. 4. There are four scenes in the proposed pose-guided dataset, including a living room, a laboratory, an apartment and a break room. For [32], [33] and [19], vision navigation is implemented on a grid. Thus, in the same scenes with the proposed pose-guided dataset, we construct a grid and train them on the grid.
We train the proposed method on a Nvidia GeForce GTX 1050 GPU. We describe the training as follows. For the proposed network, we utilize the adam optimizer with learning rate 0.001 and the network is trained for 10000 epochs. An episode only terminates when the step arrives 10000. We train the robot for 10000000 steps totally. During the training process, we select five destinations as the training set for each image set. In the test, we randomly select 100 starting states to reach a given destination. Navigation paths are the output.
Firstly, we evaluate the efficiency and flexibility. In terms of efficiency, we validate the efficiency quantitatively by comparing the navigation cost, including rotation times and movement distance. In terms of flexibility, the flexibility is analysed qualitatively by comparing the navigation paths. Finally, in order to demonstrate the comprehensive navigation performance, we perform experiments to compare the Success Rate (SR) and the Success weighted by Path Length (SPL).

4.1.Efficiency
We train and test the proposed method, [32], [33] and [19] in the proposed dataset. We train and test the proposed method, [32], [33] and [19] in the proposed dataset. And the average navigation costs of 100 episodes, including movement distance and rotation times, are compared. As we can see from Fig. 6, the navigation costs of our method are smaller than other three methods in terms of rotation times and moving distance. Specifically, in the apartment, the rotation time of our method is only 0.5 times and the movement distance is less than 1/3 times than that of other methods, respectively. In the laboratory, the scene is more complex than other three scenes. Although our method has more rotation times, its 6 movement distance is far less than that of other methods. In summary, the proposed pose-guided endto-end navigation method has lower navigation cost. The reason is that the movements in our method are determined by the poses between adjacent objects other than on a gird.
(a) Average distance (b) Average rotation times Fig.6. Navigation cost of different methods.

4.2.Flexibility
We demonstrate the flexibility qualitatively by comparing trajectories of different methods. Their trajectories are shown in Fig. 7. The visualize result shows the moving trajectories of the robot directly for different methods in the same starting and end point. Different from other methods whose trajectories are constrained on the grid, the trajectories of our method are guided by poses. Therefore, the robot can have flexible action choices at each state. Fig.7. Navigation trajectories of our strategy (blue) against A3C [32] (yellow), PPO [33] (green), target-driven model [19] (red) in four indoor scenes.

4.3.Navigation Performance
In order to further validate the comprehensive navigation performance, we evaluate the Success Rates (SR) and Success weighted by Path Length (SPL) [40] (2) where N denotes the number of testing episodes, i S is a binary indicator of success in the testing episode i, i p represents path length and i l is the shortest path distance from to the goal.  Table I summarizes the results for the four methods in the four testing scenes. The proposed navigation strategy performs better than other methods in SR and SPL. Our method can achieve 100% success rate. The SPL of our method is above 0.1. In the apartment, the scene is simple. The SPL of our method can exceed 0.3, which is better than that of other methods. In the laboratory, the scene is complex. While SPL is just near 0.1, it is still nearly twice times that of other methods. The reason is that the complexity of the environment affects the navigation performance.
In summary, compared to the state-of-the-art methods, the proposed pose-guided navigation method is more flexible and efficient. Moreover, our method has better navigation performance in both SR and SPL.

Conclusion
We present a novel pose-guided end-to-end navigation method. The main advantage is that the robot can move along arbitrary directions determined by poses between adjacent objects. What's more, we make a novel pose-guided dataset to validate our method. And DRL based action-selected strategy is proposed to select actions. The results demonstrate that the proposed method outperforms state-of-theart methods in flexibility and efficiency.