Role Playing Learning for Socially Concomitant Mobile Robot Navigation

In this paper, we present the Role Playing Learning (RPL) scheme for a mobile robot to navigate socially with its human companion in populated environments. Neural networks (NN) are constructed to parameterize a stochastic policy that directly maps sensory data collected by the robot to its velocity outputs, while respecting a set of social norms. An efficient simulative learning environment is built with maps and pedestrians trajectories collected from a number of real-world crowd data sets. In each learning iteration, a robot equipped with the NN policy is created virtually in the learning environment to play itself as a companied pedestrian and navigate towards a goal in a socially concomitant manner. Thus, we call this process Role Playing Learning, which is formulated under a reinforcement learning (RL) framework. The NN policy is optimized end-to-end using Trust Region Policy Optimization (TRPO), with consideration of the imperfectness of robot's sensor measurements. Simulative and experimental results are provided to demonstrate the efficacy and superiority of our method.


I. INTRODUCTION
The capability to navigate in densely populated and dynamic environments is one of the most important features that enable the deployment of mobile robots in unstructured environment, such as schools, shopping malls and transportation hubs. The key difference between the problem of navigating among humans and the traditional path planning and obstacle avoidance problems is that humans tend to smoothly evade each other interactively and cooperatively, rather than remaining static or maintaining an indifferent trajectory dynamics. In other words, there are social norms that need to be understood and complied to achieve maximum comfort of all involved pedestrians during navigation. We refer to this as the problem of social navigation, which aims to model such social norms and develop a robotic navigation policy that is socially acceptable to the pedestrians around.
For social navigation, the traditional approaches based on Dynamic Window Approach (DWA) [1] or potential fields [2], [3] are usually of limited efficacy as pedestrians are simply regarded as uncooperative obstacles. An illustrative example is the freezing robot problem (FRP) [4], [5], where a mobile robot will be stuck in a narrow corridor when facing a crowd of people if it lacks the ability to predict the joint collision avoidance behaviors of human pedestrians. To this M. Li, R. Jiang, S. S. Ge  end, researches have been done to understand the principles of humans' joint collision avoidance strategies and one of the pioneering works are the social force model (SFM) [6], [7]. Other joint collision avoidance model such as reciprocal velocity obstacles (RVO) have been proposed in [8], [9], [10], with an underlying assumption that all involved agents adopt the same collision avoidance strategies. These ideas are also applied to visual tracking of pedestrians [11], [12]. More recently, several attempts are made to learn probabilistic models of pedestrians' trajectories during joint collision avoidance, based on which the robot's navigation decision is generated such that it is able to behave naturally and correctly in similar situations [13], [5], [14], [15].
In this paper, we propose to augment the dimensions of human-robot interaction in social navigation by further endowing robot with appropriate group behaviors when it is travelling with a human companion. This capability is highly desirable for assistive mobile robots [16], [17], [18], which serve as assistants and companions and are expected to travel along with theirx human partners in not only home environment but also possibly crowded public areas. In other words, apart from understanding the collision avoidance behaviors of pedestrians, the robot also needs to consider the motion of its companion so as to maintain a sense of affinity when they are travelling together towards a certain goal. We call this socially concomitant navigation (SCN) and it is more challenging than the aforementioned social navigation problem, where the robot is assumed to travel alone with a simpler pursuit of reaching a specific goal while being free of collision.
To address the problem of SCN, we develop a new learning scheme called Role Playing Learning (RPL). Particularly, we formulate such problem under the framework of Partially Observable Markov Decision Process (POMDP) and reinforcement learning (RL). A neural network (NN) is used to parameterize the navigation policy of the robot, which is optimized to gives proper steering commands for the next time instance based on the robot's current and previous observations to its surroundings. To facilitate the RL process, we create a simulative navigation environment by mirroring a collections of real world pedestrians data sets and develop an on-policy optimization method called Partially Observable Trust Region Policy Optimization (PO-TRPO). In each run in an optimization iteration, the robot will attempt to play itself as a companion of a randomly chosen pedestrian by executing the NN navigation policy. The NN policy is then optimized using PO-TRPO based on a batch of collected trajectories. Compared to the existing analytically derived or data-driven approaches, our RPL scheme has the following advantages: 1) RPL scheme is less restrictive. It does not rely on the assumption that the robot and other agents (pedestrians) share the same decision-making models [8], [9], [10], [5], [14] or that the navigation goals of pedestrians are known [5], [14].
2) The formulation of RPL scheme is more generalizable and flexible. Our formulation contain no manuallydefined feature and domain knowledge (e.g., statistics of pedestrians' behaviors). It is not hardware-specific and can be easily modified to incorporate kinematics of different mobile robot platforms, sensor specifications and navigation objectives. In addition, unlike [15], [14], the learned navigation policy operates without assess to the global map of the environment. Therefore, it is not environment-specific and is well generalizable to unmet real-world scenarios. 3) We explicitly consider the noise and limitation of the robot's sensor measurements. Most approaches for social navigation assume that the robot has full and accurate knowledge of interested variables, such as positions or distance of pedestrians and obstacles [8], [9], [10], [14]. On the contrary, our RPL schemes is rooted from the situation where the robot can only perceive those lie within its sensor's Field of View (FoV), with the existence of measurement noise. 4) As a RL-based approach, RPL is efficient. Although RPL aims at solving tasks that involve interaction among robot, humans and physical environment, it does not require participation of human in both data collection and learning, which is known to be tedious and timeconsuming. Instead, the learning process is safely automated in a simulative yet realistic environment with no human intervention. We evaluate the performance of our approach in both simulations and real-world experiments, by comparing it with a baseline planner based on RVO [8] and humans, repectively. We also show that, with some tricks, the learned navigation policy can still be effective when the navigation scenario is reduced to the aforementioned social navigation, which means the robot is travelling without human companion.
The remainder of this paper is organized as follows. Related work is first reviewed in Section II. In Section III, the problem of SCN is formulated as a POMDP and associated definitions are given. RPL scheme and PO-TRPO algorithm are described in Section IV. Sections V and VI provide extensive results of simulation and experiment, followed by some concluding remarks in Section VII.

II. RELATED WORK
The problems of robot navigation in populated and dynamic environment can be addressed from a number of angles, which can be largely classified into two groups as in the following subsections.

A. Interactive Behaviors Models
Many researches have been proposed to describe the interactive navigation behaviors of humans by fitting a computational model to the observed pedestrians trajectories [19]. In this way, the robot's path planner is able to understand pedestrians' intention during joint collision avoidance and actively calculate an optimal route towards its goal.
In the field of robotics, a majority of work in this direction is done via inverse reinforcement learning (IRL) [20], which learns a cost function that explains the observed behaviors. For example, maximum entropy IRL [21] is adopted in a number works [22], [23], [24], [25], [26] for discrete human behavior prediction and route planning. However, discrete representation is less desirable when modeling trajectories, which are in nature continuous and has higher order dynamics, such as velocities and acceleration. Instead, [15] adopts Maximum-A-Posteriori Bayesian IRL [27] to learn appropriate navigation behavior of a specific mobile robot from a set of demonstration trajectories. Note that, the demonstration data in [15] is specific to configurations of the robot and its sensor and has to be collected via human operation, which could be time-consuming. On the other hand, [13], [14] learns probabilistic models of composite trajectories of pedestrians from video data by maximum entropy learning and IRL. To better capture the characteristics of observed trajectories, they propose to develop their models based on a set of features that are hand-crafted according to the domain knowledge from psychological studies. In addition, those features contain velocities and accelerations of pedestrians, which, in practice, are hard to precisely measure. Besides, interacting Gaussian process (IGP) is derived in [5] to model the joint trajectories of pedestrian while explicitly considering the effects of observation noise. Nevertheless, the design of IGP also requires several hand-crafted kernels that are formulated based on the priori information in a specific application scenario.
Other than researchers in robotics, the community of computer vision also possess great interest in pedestrian modeling. One of the important topics is trajectory prediction in video space. In [11], Linear Trajectory Avoidance (LTA) is developed as a dynamic model for pedestrians in video space for shortterm trajectory prediction and it is integrated into visual tracking system. Gaussian process is adopted in [28] to learn the motion pattern of pedestrians. Recently, Social LSTM is proposed in [29] for human trajectory prediction in crowd space. Similarly, the feature of social sensitivity is developed in [30] to analyze trajectories of pedestrians and bicyclists. While the above methods can effectively predict the navigation intention of pedestrians in videos, it is still unclear how to apply these model to navigation of robot in real scenarios.

B. Steering Models
In contrast to learning behavior models of pedestrians, a more direct perspective is to develop a steering model that outputs the immediate navigation actions given the robot's current observation to the environment. One of the pioneering work in this direction is the social force model (SFM) [6], which uses energy/potential functions to encode the social status of pedestrian. Then, the navigation motivation of a pedestrian can be derived by taking the gradients of these energy functions. Following this idea, subsequent work [31], [32], [33], [12] propose to infer the optimal parameters of the energy function by fitting them to video data. However, they are likely to produce suboptimal results if the demonstration data from humans are imperfect. In [34], the authors integrate a people tracker and an iterative A * planner, with which the robot actively follows the pedestrian travelling in a similar direction to navigate through crowded environment. [35] follow the same idea and formulate the choice of a pedestrian to follow as a Multi-Policy Decision Making process. On the other hands, [36] develops a hierarchical POMDP for predictive navigation in dynamic environment. The idea is to predict the motion of pedestrians and generate a environment-specific cost map for path planning and obstacle avoidance.
Other than navigating in a pedestrian-aware manner, several reactive collision avoidance techniques have also been developed, such as DWA [1], [37], velocity obstacles [38] and reciprocal velocity obstacles (RVO) [8], [9], [10]. The common idea of these methods is to treat pedestrians as moving obstacles and reactively update the planner every short periods to achieve collision avoidance. As mentioned in Section I, these methods are less effective for social navigation as they lack predictive abilities and are based on some restrictive assumptions, such as accurate knowledge of moving agents' velocities [37] and that all agents adopt the identical collision avoidance strategy [8], [9], [10].
Our proposed navigation policy belongs to the steering models. It takes an observation vector as input and outputs the navigation action through a stochastic neural networks. During RPL, our policy is optimized by the PO-TRPO algorithm, which is derived based on the recent advances in deep reinforcement learning (DRL) [39], [40]. DRL exploits the massive representation power of deep neural networks (DNN) [41] to build a complex yet sophisticated decision model, with which an agent can directly learn from raw signals instead of carefully crafted feature and tends to act more intelligently. Recently, there are several attempts in using DNN and DRL for robot navigation. For example, an end-to-end motion planner is learned in [42] to map raw sensor data of a laser range finder onto steering commands of a mobile robot. In [43], a decentralized multi-agent collision avoidance policy is learned via DRL, which can be thought as a DRL version of the original RVO approach [8]. Finally, a target-driven visual navigation policy for home environment is learned in [44] via DRL. They create a set 3D virtual home environments for effective and efficient training of the agent, which shares a similar idea with our proposed RPL scheme.

III. PROBLEM FORMULATION
To formulate the problem of socially concomitant navigation, we gives the following rules of SCN: 1) The robot should reach its goal as fast as possible; 2) The robot should not collide with any of the pedestrians or its companion, or run into any obstacle; 3) The robot should not run too far away from its companion. The above rules serve as a generic description of the robot's desired performance during navigation. To give concrete definitions, consider the navigation process as an infinite-horizon discounted POMDP in discrete time, defined by the tuple (S, A, F, O, p 0 , r, γ). S is a finite set of states s reflecting the navigation status of the robot. A is a finite set of actions a. In this paper, it is defined as a twosome of the translational and rotational velocities of a synchro-drive mobile robot, i.e., a = [v T , v R ]. F : S × A → S is state-transition mapping, which is characterized by the dynamics of the robot, the other humans and the environment. Without loss of generality, we assume deterministic state transition, i.e., s i+1 = F (s i , a i ), where s i , a i are the state and action taken at time t i . O is the set of the robot's observation o to the state s and β(o|s) denotes the conditional observation probability distribution. Note that, in practice, the robot's observation has only incomplete access to s or is subject to certain measurement noise, which implies o = s. p 0 : S → R is the initial state distribution, i.e., s 0 ∼ p 0 . r : S → R is a scalar reward given to the robot and γ ∈ (0, 1] is the reward discount factor.
Robot motion dynamics: In this paper, synchro-drive mobile robots are considered, whose motion equation can be approximated by assuming the robot's velocities to be constant within a certain short time period Otherwise, when v R (t i ) = 0, With the above formulations, our goal is optimizing a stochastic navigation policy P θ : O × A → [0, 1] with parameters θ in order to maximize the expected discounted reward: where τ = (s 0 , o 0 , a 0 , s 1 , o 1 , a 1 , · · · ) denotes the whole trajectory and a i ∼ P θ (a i |o i ). The specific definitions of the above ingredients for SCN will be elaborated as follows: State: Given ρ r and φ r , define the distance d and direction φ of a point ρ = [x, y] to the robot as follows: Then, the robot's distance to the goal located at ρ g = [x g , y g ] are computed as d g = d(ρ g ) and φ g = φ(ρ g ) denotes the offset angle between the robot's current heading φ r and its goal. Similarly, we can define the twosomes (d j ped , φ j ped ), (d j com , φ j com ) or (d j obs , φ j obs ) to describe the relative position of a pedestrians ρ j ped , a companion ρ j com or an obstacle ρ j obs to the robot. With such definitions, the state s is defined to incorporate the information related to the robot's navigation status as follows: where a is the current action vector and The vector p ped includes the distances and directions of n ped closest pedestrians while p com includes those of the robot's companion.
The vector p obs is a compact description to the robot's perception of the surrounding environment. Particularly, the boundaries of the occupied space (obstacles) in the environment are represented as a finite point set Z = {ρ 1 obs , ρ 2 obs , · · · , ρ j obs , · · · }. Then, the 9 variables in p obs are defined based on the following assumption Assumption 1: An obstacle ρ obs ∈ Z has no effect on the robot's navigation decision if it satisfies d(ρ obs ) >d obs , wherē d obs is a predefined finite constant.
By Assumption 1, it is sufficient to consider only obstacles in Z that are closed enough to the robot, whose distances are less thand obs . In practice, this limit may correspond to the robot's perception range. Let The components in vector p obs are described as follows: The distance to the nearest obstacle located at heading of the robot, i.e., where ǫ ρ is a small constant. For d L obs , φ L obs , d R obs , φ R obs , they represent the distance and direction of the closest and farthest obstacles on the robot's left (ρ L obs ) and right side (ρ R obs ), respectively, which are defined mathematically as follows: Then, the variables in p obs can be simply determined as the distance and directions of the above points according to Eqs. (6) and (7). Figure. 1 provides a comprehensive illustration of the state variables p ped , p com and p obs .
Observation: As discussed in the previous sections, sensors mounted on the robot are always subject to various kinds of limitation and measurement noise, which must be taken into  8). The blue, yellow and green circles represent the robot, its companion (Com.) and the pedestrians (Ped.) respectively. The red dashed circle with a radiid obs represents the boundary of the set Z in Eq. 12. The black arrow shows the current heading of the robot. Considering the robot's current position as the origin, the polar coordinates of the pedestrians, the companion, the closest( and the farthest) obstacles in each direction are compactly represented as vectors p ped , p com , p obs . account in order to develop a robust and practical navigation system. To this end, we define o as the robot's observation to the true state s as follows: By Eq. (18), we assume that the robot has accurate information about the goal position and its current taken action (i.e., the velocity commands output to the robot's motor) while its observations to p ped , p com , p obs may be imperfect. Particularly, consider the Field of Views (FoVs) for the robot's pedestrian and obstacle detectors illustrated as Fig. 2. Mathematically, let finite point sets F ped and F obs denote the current FoVs of pedestrian and obstacle detectors, characterized by threesomes (φ + ped , φ − ped , d + ped ) and (φ + obs , φ − obs , d + obs ), respectively. The robot's observations to the pedestrians' relative positions are obtained aŝ for j = 1, · · · , n ped , withd ped being the measurement noise/error. Fig. 2: Field of Views of the pedestrians (green) and obstacles (blue) detectors. The arrow (φ = 0) points towards the current heading of the robot. The constants φ + ped , φ + obs and φ − ped , φ − obs denote the maximum and minimum offset angles in the corresponding FoVs. Finally, d + ped and d + obs represent the maximum detection ranges for the pedestrian and obstacle detectors, respectively. The values of these constants should be determined according the specific configurations of the robot's sensor and the corresponding detection algorithms. Any pedestrian/obstacle outside the FoVs is not observable and therefore will be omitted.
Similarly, definê Then, their distance and directions to the robot are calculated using Eqs. (6) and (7). For observation to the robot's companions, we rely on the following assumptions. Assumption 2: The companions ρ 1 com , · · · , ρ ncom com are always observable to the robot.
Then,p com = [d com , φ com ], wherê Remark 1: By Eqs. (20) and (23) to (28), it is implied that the observation/measurement noisesd ped ,d obs andd com are additive and independent in different observations. A typical example of such noise is the Additive Gaussian White Noise (AGWN).
Remark 3: The mathematical definitions of the variables in observationsp ped ,p com ,p obs are given for better understanding and are required only in the simulative RPL process. In practice, it is clear that these values can be directly measured via the robot's onboard sensors without accessing the actual 2-D Cartesian coordinates [x, y] of the considered point sets (e.g., Z, F ped and F obs ). For example, consider a robot equipped with a laser range finder. These distances and offset angles can be easily obtained from the returned ranges array [50].
Reward function: A scalar reward will be given to the robot as an award of reaching the goal or a penalty of colliding with obstacles/pedestrians/companions or losing its companions. Particularly, at time t i , the process of SCN will be terminated if any of the following three termination conditions is true.
1) Goal Reaching Condition 2) Collision Conditions 3) Stray Condition Based on the above three terminal conditions, a reward r will be given to the robot as follows: Clearly, a positive reward will be given to the robot if it reaches its goal and it will receive a large negative reward if it collides with anything or be stray from its companion. Otherwise, the robot will receive an intermediate reward −10|v R |, which penalizes the robot for its rotational velocity to encourage a smoother trajectory with less turning behaviors.

IV. ROLE PLAYING LEARNING
In this section, we described the RPL scheme to learn an effective navigation policy P θ (a|o) for SCN in an efficient data-driven manner. The core idea is to transform the crowd trajectories data collected from real-world into a simulative and dynamic navigation environment, where the robot can play itself as a virtual pedestrian and iteratively improve the performance of P θ (a|o) via Partially Observable Trusted Region Policy Optimization (PO-TRPO).
Consider a set of simulative navigation environment E = {E 1 , · · · , E j , · · · }. Each environment E j = (T j , M j ) contains a set of pedestrian trajectories T j = {ρ k 0:T k } and a binary map M j that annotates the 2-D Cartesian coordinates of obstacles/occupied space in the environment. With E, the abstract process of RPL is described by the following pseudo codes in Algorithm 1.

Algorithm 1 Role Playing Learning
Initialize navigation policy P θ for Iter = 0, 1, · · · , MaxIter do while Number of collected sample time steps ≤ Batch size do Randomly choose an environment E j from E and then a trajectory ρ k 0:T k from T j . Initialize the robot's position at ρ k 0 and initial velocities Choose the robot's heading such that φ g = 0.
Choose SCN Mode with probability 0.5 if SCN Mode then Assign ρ k T0:T k as the trajectory of the robot's companion, where T 0 = arg min Create a synthesized companion that moves along the robot end if Assign all other trajectories in T j as pedestrians.
while None of the termination conditions in (29)

to (33) is satisfied do
Update the states and observations of the robot according to Eqs. (8) and (18). Let the robot execute its policy P θ . Update the robot's position according to dynamics (1) to (4) Calculate the current reward from Eq. (34) Update the positions of the companion and pedestrians according to the trajectories in T j . end while end while Update P θ using PO-TRPO. end for Companion Synthesization in non-SCN mode: As described in Algorithm 1, RPL actually incorporates two different navigation scenarios: the SCN proposed in this paper and the traditional social navigation scenario, where the robot has no human companion. This helps develop a navigation policy adaptable to both situations, with no restrictive assumption on the existence of companion. Particularly, the companion position vector p com and its observationp com are synthesized, with d com =d com = 0.8 for every time step while φ com = φ g . It is clear that the synthesized p com is equivalent to the situation where the companion is travelling non-distractively along the robot with a constant distance and guarantee that termination conditions (31) and (33) are always false.
On the other hand, in SCN mode, the companion is assigned with a truncated trajectory ρ k T0:T k such that the initial robotcompanion distance is sufficiently large.
In this paper, we construct a deep policy neural network to parameterize the navigation policy P θ , whose structure is shown in Fig. 3. The policy network P θ is to be trained with the Trust Region Policy Optimization (TRPO) [40] method. Fig. 3: Structure of the deep policy network P θ . At time t i , the observation vector o i is input to the feature network, which is a feedforward multi-layer perceptron (MLP). The output of the feature network is then fed to a LSTM network [51], a recurrent network for aggregation of the information collected through the navigation process. The LSTM network's outputs are assigned as the mean vector µ ∈ R 2 of the diagonal Gaussian unit N (µ, Σ) on the right. The covariance matrix Σ = σ 2 I ∈ R 2×2 , however, is independent of o i amd it is designed to be gradually decreasing during training and fixed during tests and experiments. Finally, the actions a i = [v T , V R ] are drawn according to N (µ, Σ).
However, the original TRPO method is derived based on fully observable MDP, which can not be directly applied to our problem due to the imperfect observation in our formulation and practice. Thus, we proposed to extend the original TRPO algorithm as PO-TRPO, which will be described in the following subsections.

A. Trusted Region Policy Optimization
The TRPO [40] algorithm is an effective on-policy optimization method for large nonlinear policies and tends to give monotonic improvement during the iterative optimization process. To be specific, a fully observable MDP is considered by TRPO and therefore the policy to be optimized is formulated as P * ξ (a|s), where ξ is the parameter vector of the policy P * . Note that, P * ξ (a|s) determines the action a directly from the true state s, which differs from our observation-based policy P θ (a|o). Let us consider the following standard definitions of the state-action value function Q ξ (s i , a i ), the value function V ξ (s i ) and the advantage function A ξ (s i , a i ): where In addition, define ν ξ as the discounted visitation frequencies ν ξ (s) = p(s 0 = s) + γp(s 1 = s) + γ 2 p(s 2 = s) + · · · (39) where s 0 ∼ p 0 , a i and s i≥1 are generated according to P * ξ and F . Let ξ − denote the old parameters in last iteration. TRPO proposes to optimize the parameters ξ iteratively regarding the following objective function: subject to E s∼ν ξ − [D KL (P * ξ − (·|s) P * ξ (·|s))] ≤ ǫ (41) where q * (a|s) is the importance sampling distribution and D KL (P * ξ − P * ξ ) is the Kullback-Leibler divergence between the old and current policies.

B. Partially Observable TRPO
As mentioned, our navigation problem is considered as a POMDP. The policy P θ (a i |o i ) depends on the observation o i instead of the true state. Therefore, we write the objective function (40) and the constraint (41) as For PO-TRPO, samples are collected by executing the old policy P θ − (a|o) to generate a set of trajectories, such as where i = 0, · · · , T − 1.
Next, for a trajectory s 0:T , we use the generalized advantage estimation (GAE) [39] to construct an empirical estimationÂ of the advantage function A θ − (a i |s i ) as the following: andV ζ (s i ) is the estimation of the value function (36) with parameters ζ (and ζ − being the old parameters). By collecting a set of K trajectories {s k 0:T k , o k 0:T k , a k 0:T k } K k=1 ,V ζ is obtained by solving the following constrained regression problem [39]: Finally, as the conditional observation probability distribution β(o|s) is independent of parameters θ and time, we obtain an estimation of the objective function (42) and the constraints (43) by replacing the expectations with sample averages as: wherē which has the same form as the one obtained in [39], except that the policy P θ (a|o) is conditioned on observation o instead.
Finally, the constrained optimization problem described in (49) and (50) is solved by conjugate gradient algorithm [52] algorithm. To summarize, the pseudo code for PO-TRPO update in Algorithm 1 is given as below: Algorithm 2 PO-TRPO Compute the estimated advantagesÂ i for all time steps using GAE with the estimated value functionV ζ . Update θ with objective function (49) and constraints (50) Update ζ with objective function (47) and constraints (48) V. SIMULATION As a data-driven approach, our deep neural network policy requires a massive amount of data to learn the socially concomitant navigation behavior. In this section, we describe how to construct a simulative environment according to the proposed RPL scheme. Particularly, the environments, the deep neural network policy and the PO-TRPO algorithm (Algorithm 2) are developed under the framework of RLLAB [53]. We make use of trajectories of interacting pedestrians collected from five different data sets, which includes the ETH and Hotel video clips from the ETH Walking Pedestrians (EWAP) [11], the motion capture (MC) data set from [14], as well as the Zara and UCY video clips from [32]. Note that, the Zara and UCY data sets have multiple subsets: Zara01, Zara02, Zara03, UCY01 and UCY03. Thus, there are totally 8 different RPL environments, i.e., E = {E 1 , · · · , E 8 }. The details of these 8 environments are summarized in Tab. I. Each trajectory in these environment provides the ID and a sequences of 2-D Cartesian positions of a pedestrian with a sampling period ∆t = 0.1 second. In addition, eight binary grid maps M 1 , · · · , M 8 representing the occupied space/static obstacles are given. However, these maps are kept unknown to the robot throughout training and evaluation. They are only used to simulate the robot's perception to the environment as the state p obs and observationp obs . Without loss of generality, we use the ETH data set as the evaluation environment and all other data sets in the Tab. I as training environments. In other words, the learned policy's performance will be assessed in an RPL environment that is excluded during training, which reflects whether it can properly generalize to uncovered situations.
As some of the trajectories in these environments are of people who were wandering or remained approximately stationary, they are excluded from the candidates of the robot's companion but will still be considered as pedestrians when the robot is navigating in the same environment.
We use a feed-forward neural network with 2 hidden layers as the feature network in our NN policy, containing 256 and 64 Tanh units, respectively. Its output is then fed to a LSTM network with 64 units. The variance of of the Gaussian output unit σ is chosen to be linearly decaying from 0.5 to 0.05 in 100 training iterations, which effectively encourages exploration during the early stage of learning and ensures convergence of the navigation policy. For GAE, a 3 layer feed-forward network with 256,64,16 Tanh units are used, with γ = 0.995 and λ = 0.96. The update step size for policy network is adaptively chosen as ǫ = 0.01/σ. For GAE update, a fixed step size ǫ 1 = 0.1 is used. The update batch size (Batch size in Algorithm 1) is 50000.
In RPL, we consider at most 3 pedestrians (i.e., n p = 3). Thus, the state p ped and observationp ped will only describe the 3 closest pedestrians and omit the others. For the situation where less that 3 pedestrians are perceived, dummy static pedestrians will be created in the remote corner of the environment so as to maintain the dimensions of p ped andp ped .
Considering a Kobuki Turtlebot 2 with a Hokuyo URG-04LX laser range finder [50] mounted on its top, we specify the sensor limitation of the robot in simulation as follows: The measurement noisesd ped ,d com andd obs are modeled by zero-mean Gaussian N (0, σ 2 ped ), N (0, σ 2 com ) and N (0, σ 2 obs ) with their variances specified as follows: Finally, the maximum translational and rotational velocities are assigned as 0.7m/s and π 3 rad/s, i.e., 0 ≤ v T ≤ 0.7 and |v R | ≤ π 3 and ∆t = 0.1. An example of our RPL environment constructed from the ETH data set is illustrated in Fig. 4 (a) Real-world Environment

A. Results
We trained our deep policy network for 1200 iterations with the data from RPL environments except for the held-out ETH environment. The curve of average discounted return obtained from each batch of trajectories is visualized in Fig. 5

Fig. 5: Average discounted return as RPL progresses
We compare the performance of our policy with a planner based on RVO [8], where the robot, its companion and the surrounding pedestrians are treated agents. In every time steps, the positions and velocities of all agents are given to the planner. Note that, for fair comparison, the agents' positions are subject to noise described in (54) and (55). For observations to obstacles, we assume the planner has full and perfect knowledge as required in the original RVO algorithm. With this protocol, we update the robot's position according to the planner's output and update the positions of the other agents according to their own trajectories in the RPL environments. The same termination conditions in Section III are applied to the robot directed by the RVO-based planner to determine whether the robot has conducted an successful navigation. For both of our policy and the RVO-based planner, we conduct 300 trials in the evaluation environment and compute the rates (in percentages) of different terminal conditions (RG: the robot reaches the goal successfully; HC/HP/HO: the robot hits a companion/pedestrian/obstacle; and LC: the robot loses its companion). The performance statistics of our policy and the RVO-based planner in SCN scenarios are listed in Tab. II. It can bee seen from Tab. II that our policy performs much better than the RVO-based planner in SCN. The RVO-based planner has a much lower success rate (29.7%) while its rate of LC is 47%, suggesting that it frequently losses its companion in SCN. Clearly, this is due to the fact that RVO is in nature a collision avoidance algorithm. Thus, it simply takes the robot's companion as another normal agent and the robot tends to stay far behind its companion to avoid collision instead of actively following it. On the contrary, our policy achieves a much higher success rate (77%). This indicates that it learns to effectively balance the objectives of SCN so that the robot is able to reach the prescribed goal while maintaining its distance to its companion and avoiding collision with other agents in the environment.
In addition to SCN, the scenarios without companion are also tested, which, as analyzed in the previous sections, reduces to the traditional social navigation scenarios. The comparative results are shown in Tab. III. For situations without companion, our policy still outperforms the RVO-based planner with higher success rate (84% to 80%) and lower HP rate (13.7% to 18%).
Finally, it is worth noting that the RVO-based planner requires velocities of the companion/pedestrians and an accurate global map of the static obstacles. Conversely, our policy depends only on position measurements that are directly accessible from the robot's onboard sensors, which is therefore much simpler and more practical.

VI. EXPERIMENTS
In experiments, we assess the performance of our developed navigation policy by comparing it with humans in the same scenarios. Particularly, a robot and a human are to repeat each specific navigation scenario for 10 times, respectively. Then, the the following two metrics are calculated: 1) Average minimum distance to the pedestrians (D ped ): the average of the minimum distance between the robot/compared human to other pedestrians throughout a trajectory . 2) Average maximum distance to the companion (D com ): the average of the maximum distance between the robot/compared human to its/his companion throughout a trajectory. We use the same mobile platform (a synchron-drive Turtlebot 2 with a Kobuki base) and the same laser range finder (Hokuyo URG-04LX) simulated in last section. For pedestrian detection and localization, we adopt the ROS-compatible leg tracker in [54]. We use an ultra wideband (UWB) indoor positioning system to localize the companion and the navigation goal, which can then be easily mapped to the observationŝ p com and d g , φ g based on the odometry of the robot. Finally, a laptop is placed onboard as the processing unit and the policy is operated with a period of 0.1 second. The experiments are conducted in a narrow corridor with width of 1.56 meters as shown in Fig. 6, which is a typical scenario that requires pedestrians to navigate cooperatively.  In this subsection, we examine our method's performance in traditional social navigation scenario. Particularly, the robot is required to pass the corridor with two oncoming pedestrians and arrive at a goal that is 7 meters ahead. In addition, a control experiment of 3 humans (one as the compared human and the other two as pedestrians) is conducted in the same space. The metricD ped is computed. Example trajectories of the robot and the human control are shown in Fig. 7. In the robotic experiments, the trajectories of pedestrians are obtained from the robot's laser range finder while the robot's trajectory is based on its own odometry sensor. On the other hand, all trajectories in the human control experiments are captured using the UWB localization system. From Fig. 7, it is clear that the robot with our policy is able understand human's cooperative behavior for collision avoidance and navigate in an appropriate manner such that both itself and the other two pedestrians can successfully pass through the corridor. Specifically, when observing the two pedestrians (blue and purple) 4 meters ahead. The robot started to approach the wall on its left side so as to create free space on the right for the pedestrians to smoothly walk through. By comparing both figures in Fig. 7, we can see that the robot is as proactive as human since both black trajectories in Fig.  7(a) and Fig. 7(b) started to make space for the oncoming pedestrians at the early stage of cooperative avoidance process. As for the performance metrics, the average minimal distance to pedestrians for our robot isD ped = 0.35m. Although it is smaller than that of the human control experiments (D ped =0.56m), this value still indicates a safe and decent navigation behavior of our robot as its radius is only 0.17m.

B. Scenario 2: Socially Concomitant Navigation
In this subsection, the scenario of SCN is studied. A human companion initially standing in front of the robot will start to walk through the same corridor while another pedestrian is passing from the other end. As described in the previous sections, the robot with our policy should closely navigate with its companion and avoid the oncoming pedestrian cooperatively. An additional metricD com is used to evaluate the performance of our policy by comparing with the statistics obtained from another 10 human control experiments. Example trajectories are shown in Fig. 8 and the performance metricsD ped and D com are summarized in Tab. IV. As shown in Fig. 8 and Tab. IV, the robot is able to achieve both objectives of SCN. On one hand, it is effectively engaged into the joint collision avoidance process. The resulted behavior is similar to that observed in the last subsection and the robot even has a slightly largerD ped . On the other hand, the average maximum distanceD com is 1.05m, which is within the limit (2m) we specified in the learning process and nearly the same as that of the compared human, showing that the robot can actively navigate along with its companion instead of deviating to other areas or lagging itself behind. This shows that the robot driven by our policy is able to understand the pace of its companion and achieve a similar sense of companionship in terms of distance.
In sum, the above results demonstrate the practical efficacy of our methods for both the traditional social navigation and the more complicated SCN scenarios. It proves that the policy learned from our RPL simulative environment is transferable to uncovered real-world situations.

VII. CONCLUSIONS
In this paper, the problem of socially concomitant navigation (SCN) has been investigated and formulated under a POMDP (a) Trajectories of the robot and its companion (moving from left to right) and a pedestrian (moving from right to left).
(b) Human control experiment in a similar SCN. The black (compared human) and orange (companion) trajectories are from left to right and the blue (pedestrian) trajectory is from right to left.