Inverse Reinforcement Learning of Pedestrian–Robot Coordination

We apply inverse reinforcement learning (IRL) with a novel cost feature to the problem of robot navigation in human crowds. Consistent with prior empirical work on pedestrian behavior, the feature anticipates collisions between agents. We efficiently learn cost functions in continuous space from high-dimensional examples of public crowd motion data, assuming locally optimal examples. We evaluate the accuracy and predictive power of the learned models on test examples that we attempt to reproduce by optimizing the learned cost functions. We show that the predictions of our models outperform a recent related approach from the literature. The learned cost functions are incorporated into an optimal controller for a robotic wheelchair. We evaluate its performance in qualitative experiments where it autonomously travels between pedestrians, which it perceives through an on-board tracking system. The results show that our approach often generates appropriate motion plans that efficiently complement the pedestrians' motions.


I. INTRODUCTION
A UTONOMOUS navigation in pedestrian areas has been an active area of research due to applications in smart wheelchairs or delivery robots. In order to navigate smoothly and safely in human crowds, robots need to consider how their own actions affect surrounding humans and reason about the individual costs and benefits of each involved agent. We propose to use a collective cost function that describes how desirable a particular combination of individual trajectories is from the collective point of view of all agents involved. We apply Inverse Reinforcement Learning (IRL) to capture important characteristics of human navigation from empirical data.
We assume a single objective function, whose features depend on each agent, similar to recent approaches [1], [2], [3] to IRL of multi-agent navigation. Similarly, techniques for crowd modeling and simulation [4], [5] optimize a single cost function over all agents' actions. Such formulations are appealing, as Traditional IRL assumes that observations are globally optimal, causing their computational cost to scale exponentially with the number of state dimensions. However, a more recent approach [6] overcomes this curse of dimensionality (as shown in [6]) by assuming that examples are only locally optimal. We argue that this framework is particularly well suited to modeling human crowds, since they may evolve in multiple qualitatively different ways with similar probabilities. For example, a pedestrian may pass another pedestrian on the left or on the right side, while trajectories in between the two alternatives should be costly because they lead to a collision. Thus, the two alternatives can be described as local optima which are separated by trajectories with higher costs.
Despite the availability of promising algorithms, the design of appropriate features for the problem of multi-agent navigation IRL remains a challenging and critical step in achieving a cost function that generalizes in a meaningful way and proves useful in real-world situations. Common features to account for collisions include f = d −2 [2] and f = exp(−d 2 /σ 2 /2) [3], where d denotes the distance between agents' centers. As both features are only position-dependent, they do not reflect how human walkers avoid collisions, namely by anticipating and regulating the minimum predicted distance d mp [7] (cf. Fig. 1(a)).
Prior empirical work [4] derived an anticipatory interaction potential E, defined as where τ denotes the time to collision of two given agents, τ o = 3 s is a time horizon, and η = 1 s 2 may be set arbitrarily. Note that E drops discontinuously to zero when approaching agents change their velocities to avoid each other. We propose a novel featureẼ that smoothly approximates E (cf. Fig. 1(c)), as required by the framework for continuous IRL [6]. We train two models incorporatingẼ on real crowd data. Our quantitative evaluation demonstrates more accurate and smoother predictions of pedestrians' trajectories compared to a state-of-the-art alternative [2]. Further, we qualitatively evaluate our approach's performance at controlling a mobile robot navigating in a real crowd.

II. RELATED WORK
Several previous works have used IRL to learn navigation in crowds. Some works [8], [9], [10], [11], [12] use discrete state and action spaces. These approaches operate on grids over This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the entire state space, and thus, they cannot easily incorporate multiple agents due to exponentially increasing computational cost. Instead, they typically consider pedestrians as exogenous inputs which affect the cost/reward features only. While such an approach may yield sensible behavior, it cannot produce a model which reasons about interactions between agents, since the behavior of other agents is treated as a given input to the model.
In contrast, the works [1], [2], [6], [13] have adapted the popular Maximum Entropy (MaxEnt) IRL [14] method to continuous state and action spaces. In the works on continuous spaces, trajectories are either parametrized by the control actions at each discrete time point [3] or as splines [1], [2], [13]. For dealing with the high dimensionality of the space of possible multi-agent trajectories, [3] resort to a local approximation [6] of the exponential policy for MaxEnt IRL, similar to our work. In contrast, [1], [2], [13] simplify the policy by discretizing it into topological variants (who passes on which side of who) and/or use computationally expensive Monte Carlo techniques.

A. System Model
For a system comprising n agents, let p i , v i , a i ∈ R 2 denote the position, velocity, and acceleration of the i-th agent, respectively, where i ∈ {1, 2, . . . n}. Let the system's state contain all positions and velocities. The system's action is defined by all accelerations, on the other hand.
The system's transition from one state x (k) to another state x (k+1) , under an action u (k+1) over a time step of duration h, is described by the linear dynamic system where I ∈ R 2n×2n denotes the identity matrix. The above dynamic model (4) corresponds to applying constant accelerations over a duration h.
A state-action trajectory S := (X , U , T ) is defined by a state sequence X := {x (k) } K k=0 and an action sequence U := {u (k) } K k=1 satisfying the dynamic model (4) for 0 ≤ k < K with a uniform time step h, and by a time sequence T := {t (k) } K k=0 , where each t (k) denotes the time at which the state x (k) is attained, and t (k+1) − t (k) = h, ∀k.

B. Framework for Inverse Reinforcement Learning
We assume that there is a scalar cost function J(X , U ) which reflects the collective objectives of the multi-agent system under consideration, e.g. navigating efficiently and without any collision. For a given cost function J and an initial state x o , we model the system's behavior over K + 1 uniform time steps as a local optimum of the optimal control problem Given a set E := {S l } L l=1 of L examples S l (multi-agent trajectories) recorded from a human crowd, we aim at inferring a meaningful cost function J, such that the examples S l are approximated by local optima of (5) with J.
We aim at linear estimation of J in the form where the feature vector f (x, u) ∈ R q is specified a priori to describe quantities of interest, and where w ∈ R q is the unknown vector of the features' weights. We use a linear technique as we hypothesize that our features describe non-interacting cost components. In the original maximum entropy (MaxEnt) framework [14], the weights are learned by maximizing the examples' likelihood under a policy which assigns a probability density to an action sequence in proportion to the exponential of its reward (i.e. negative cost). We adopt the framework [6], where this policy is approximated as a Gaussian in U around the examples, thereby relaxing the original MaxEnt framework's assumption of their global optimality to only local optimality.

C. Cost Features for Cooperative Navigation
For simplicity, we use only four features, which we consider as the most essential ones for multi-agent navigation, describing individual effort and goal-directed motion, and pairwise distance and anticipatory collision avoidance, respectively. Individual features are of the form f : 1) Individual Feature Contributions: An agent's effort can be described by the feature contribution where a i denotes the agent's acceleration. Alternatively, let define a feature contribution smoothly approximating |a i |, where λ > 0 controls the function's sharpness at the origin. An agent's deviation from its desired motion is described by the feature contribution where v i andv i denote the agent's actual and desired velocity, respectively.
2) Pairwise Feature Contributions: For a pair of agents with respective indices i and j, let p ij := p i − p j and v ij := v i − v j denote their relative position and velocity, respectively. We employ a simple pairwise feature contribution commonly found in related work, namely a gaussian function of both agents' distance [3] f (c) where σ > 0 denotes the activation radius. Additionally, we define the pairwise feature contribution with ψ := ∠(−p, v), and ψ c denoting half the opening angle of the cone V O ∞ (cf. Fig. 1(a)), omitting the subscripts ij for brevity. Thus, z = 0 holds when the relative velocity is at the edge of V O ∞ , which becomes the activation threshold.
Letτ :=D/|v|, whereD approximates the distance D that the relative position p can travel along the relative velocity v before entering the disk of radius R centered at the origin (cf. Fig. 1(a)). We defineD 2 as where ε 2 > 0 prevents division by zero, and d mp ≥0 denotes the minimum predicted distance (cf. Fig. 1(a)). Disregarding ε 1 , it holdsD 2 = D 2 , for ψ = 0 or ψ = ψ c . We set ε 1 := 0.22R 2 to ensure thatD 2 > 0 always holds. To see this for the only non-trivial case |p| < R, we replace d 2 mp in (12) by its upper bound |p| 2 and analyze the resulting cubic function of |p|, to find that its minimum on 0 < |p| < ∞ is positive. The Fig. 1(c) shows the resulting approximation for the same configuration as for Fig. 1(b), where we set s = 10 and ε 2 = 0.01 m 2 /s 2 .
3) Feature Vectors and Normalization: We define as two alternative feature vectors, which differ only in the first feature f (a2) or f (a1) , which sum accelerations' squares or smooth magnitudes, respectively. We normalize features and weights to reveal their relative importance. Letf (E) := P E 80 {f } define the normalizer for any feature f as its 80-th percentile over a sample set E. Then, for any feature f and the respective weight w, we define the corresponding normalized feature as φ := f/f and the corresponding normalized weight as θ := wf , such that the corresponding term of the cost function can be written equally as wf = θφ.

IV. LEARNING EXPERIMENTS
We apply our IRL approach to real-world multi-agent trajectory data from the public datasets DIAMOR [15] and ETH [16]. Using f (a 2 ) and f (a 1 ) as feature vectors, we learn two respective cost functions J (a 2 ) and J (a 1 ) , which model multiple agents' navigation choices, and we evaluate their accuracy on both datasets. The datasets DIAMOR and ETH have been recorded in a shopping mall and on a university campus, respectively, and include wheelchair users (DIAMOR) and pedestrians (both). Our code is available at https://github.com/epfl-lasa/navioc.

A. Data Pre-Processing
We obtain multi-agent examples of individual trajectory segments and associated desired velocities as follows.
1) Fitting Individual Trajectories: The (noisy) trajectory data of an agent, as given originally in a dataset, is denoted as , consisting of M sequential time valuest (i) and corresponding positionsp (i) . A state-action trajectory S, describing the single agent at hand (n = 1) according to Section III-A, is fit to the data by solving the following quadratic program. We jointly optimize X , U and a sequence of interpo- where α > 0 is a regularizer controlling smoothness, Trajectories longer than 100 h are fit in an incremental fashion, in order to keep the dimensionality of the optimization problem (16) at an efficiently tractable level. Namely, we sequentially fit multiple temporally overlapping state-action trajectories of 100 time steps each to sub-sequences of the original trajectory's data, where additional constraints enforce continuity at stitching points. The Fig. 2 shows original tracks from DIAMOR, and the corresponding fit trajectories for h = 0.05 s and α = 0.01s 4 .
2) Estimating Desired Velocities: We assume each agent to have a constant desired speedv, and we estimatev as the mode of the agent's speed histogram, disregarding speeds below a threshold v min = 0.3 m/s. The desired orientation of any agent is assumed to be timevarying and denoted here asφ (k) . For DIAMOR, any agent's ϕ (k) is estimated to be left or right dependent on the actual velocity v (k) x being negative or positive, respectively. For ETH, we first identify any agent's goal among the dataset's four designated goals as the one which the agent is facing mostly during the later half of its trajectory. Secondly,φ (k) is defined as the orientation of the vector from the agent to the goal at each instant k. Finally, for each agent, the desired velocity is defined asv (k) :=v[cosφ (k) sinφ (k) ] T , if |v (k) | > v min , or as zero, else. Thus, for each agent, a sequenceV := {v (k) } K k=0 of desired velocities is obtained.
3) Sampling Multi-Agent Trajectories: The time domain covered by individual trajectory fits (per dataset) is divided into adjacent intervals of uniform duration T = 4.8 s, by convention [2]. For each such interval, the parts of those trajectories which are defined on the entire interval are aggregated into a multi-agent state-action trajectory. The agents' desired velocities are taken as constant over each interval and equal to their values at the interval's initial time. For DIAMOR, trajectory   Fig. 2) are disregarded, whereas for ETH, all agents are considered.

4) Defining Training and Test Sets:
We inspect each multiagent sample obtained from DIAMOR and discard those which contain agents whose desired motion does not seem aligned with the x-direction, e.g. because they traverse the corridor along the y-direction or diagonally due to intrinsic motives and not to perform avoidance maneuvers. To group the remaining samples from DIAMOR in a training set E 1 and a test set E 2 , they are sorted in their temporal order and then assigned in an alternating fashion to E 1 or E 2 , to ensure that the two sets reflect similar crowd conditions. In contrast, all multi-agent samples obtained from ETH are grouped in the set E 3 , without performing any manual selection. The Fig. 3 shows the histogram of the number of agents n per sample for each of the three datasets. Their respective sizes are |E 1 | = 22, |E 2 | = 21, and |E 3 | = 99.

B. Training
We have adapted the publicly available software package 1 by [6] to learn from examples of varying dimensionality (due to varying numbers of agents) and complemented the package with our system and feature definitions. Training on the dataset E 1 takes around 4 and 7 minutes for the feature vectors f (a 2 ) and f (a 1 ) , respectively.
For any feature f , the Table I reports the normalizerf (E 1 ∪ E 2 ), which has been computed with respect to all the selected samples from DIAMOR, and the learned normalized weights θ (a 2 ) and θ (a 1 ) for the feature vectors f (a 2 ) and f (a 1 ) , respectively. Weight vectors have been scaled such that the acceleration feature's normalized weight equals one.

1) Choice of Hyper Parameters:
The weights θ are initialized as random noise with zero mean and standard deviation 0.01, except for the acceleration weight, which needs to be negative for convergence and thus is set to −1. We observe that for too small α, only a negligible weight is learned for the interaction feature f (e) . The sharpness s for the feature f (a 1 ) strongly affects the learned behavior, as for both very small or large s, acceleration is penalized more strongly relative to the velocity error, such that agents slowly adopt their target velocity.

C. Evaluation
To evaluate the learned models on a given example, we optimize a multi-agent trajectory according to (5) under the learned cost functions. For examples from E 1 and E 2 , each local optimization is initialized by the respective example's complete actual trajectory, and x o is specified as the example's initial state obtained by non-causal fitting according to Section IV-A1. In contrast, for E 3 , each local optimization is initialized with U ≡ 0, and x o is specified as the causal estimate of each agent's state at time zero which is provided in the ETH dataset. We denote the euclidean distance between the optimized positions and ground truth as the modeling error e model , for E 1 and E 2 , and as the prediction error e predict , for E 3 . As a baseline, we also evaluate on all example sets a model assuming constant velocity (CV) after starting from a given initial state x o .
Examples of actual and reproduced trajectories are shown in Fig. 4. For both our models and CV, Fig. 5 plots the modeling error e model on the DIAMOR training and test sets E 1 and E 2 , respectively, as a function of time Δt after the initial time of the trajectories considered. The Fig. 6 compares the prediction error e predict on the test set E 3 from ETH for our models, CV, and the corresponding results reported in [2] for their IRL approach.
The Table II reports the number of collisions for all example sets, comparing ground truth with the learned models and CV. Here, any isolated period in which the distance between two agents is below their combined radius R = 0.4 m counts as a single collision. The Table II shows that our models generate very few (J (a 2 ) ) or zero (J (a 1 ) ) collisions.

D. Discussion
It can be seen from Fig. 5 that both learned models provide a more accurate description of pedestrians' trajectories than the baseline model CV assuming constant velocity, and that good generalization from the training to the test set is achieved for Fig. 6. Prediction error is calculated as the distance between agents' observed and predicted positions as a function of time Δt from the instant at which the prediction is issued. The plot shows the prediction error's sample mean and the interval between its lower and upper quartiles (shaded area) for our two models J (a 2 ) and J (a 1 ) , as well as the sample mean reported in [2] for their IRL approach and the sample mean for a baseline assuming constant velocity, all for the test set E 3 sampled from the dataset ETH [16].    2) The pedestrian (red) has avoided the robot on the more efficient side, and the robot attempts to avoid a standing pedestrian (blue). (b. 3) The robot attempts to pass in between the two standing pedestrians (blue and green/purple) and is about to be stopped by the supervisor. (c.1-2) At first, the robot plans to avoid a pedestrian (green), but the pedestrian changes course faster, such that the robot's plan changes back to a straight path. (d.1-2) The robot and a pedestrian (the supervisor) reciprocally avoid each other. (e) A pedestrian (red) overtakes the standing robot and then changes direction away from the robot as it starts moving; subsequently, the robot shows slight adaptation to approaching pedestrians (green, purple). both models J (a 2 ) and J (a 1 ) . In terms of mean modeling error, J (a 2 ) outperforms J (a 1 ) especially at low Δt. Similarly, Fig. 6 shows that our models generate more accurate predictions than the baseline. For Δt < 4 s, our models' predictions are also more accurate than those by [2]. Comparing J (a 2 ) and J (a 1 ) , it is again visible that for low Δt, predictions by J (a 2 ) are more accurate on average than those by J (a 1 ) , whereas for high Δt, the opposite can be said. For Δt > 4 s, it seems that the approach by [2] would yield the most accurate predictions, considering the trend in the error curve (which remains speculation, however, due to missing data in this regime).
Regarding Fig. 6, it is noteworthy that the error curves' sense of curvature differs across models, namely, the curves for J (a 2 ) and CV bend upwards, whereas the curve by [2] bends downwards, and the curve for J (a 1 ) is rather straight. This observation is related to how the different approaches take into account the agents' desired velocities or goals on the one hand, and their initial velocities on the other hand. In the framework of [2], trajectories' endpoints are specified a priori. Thus, when accurate goal estimates are available, predicted and actual trajectories will tend to converge again for later points in time with their method. In contrast, our models consider desired velocities and trade off following them versus keeping accelerations small. Since J (a 2 ) gives higher weight to accelerations than J (a 1 ) (cf. Table I), it leads to slower adoption of desired velocities in favor of maintaining initial velocities (similarly as CV). This difference is also visible in the reproduced trajectories, comparing e.g. the ones shown in Fig. 4(2.b), (2.c).

V. ROBOT EXPERIMENTS
To investigate our IRL method's capability to guide a robot through human crowds, we implement an optimal control system on the smart wheelchair Qolo [17] and deploy the robot on EPFL campus, in a (usually) busy corridor (cf. Fig. 8). Our experiments were conducted under approval by the Human Research Ethical Committee of EPFL under review no. HREC-032-2019, renewed in January 2022.
1) Implementation: We use the package MinFunc 2 to optimize a cost over the actions of the robot and of tracked pedestrians on-line (cf. Fig. 7). The optimal controller's objective is defined by the learned model J (a 1 ) , which allows to command the robot via its desired velocity, as the respective feature's weight θ (v) (a 1 ) is sufficiently high. When formulating the optimal control problem (5) to be solved on-line, we increase the time step to h = 0.4 s and use only K = 12 steps, in order to obtain the same planning time horizon of 4.8 s as in the learning phase at lower computational cost. To account for the robot's larger diameter, we set R = 0.6 m for the interaction energy feature. The robot's desired velocity is set equal tov = [1 m/s, 0], in a reference frame whose x-axis aligns with the corridor.
We implement an on-board tracking system 3 for pedestrians, using the Robot Operating System (ROS) and a LiDAR sensor mounted on the robot's front to detect legs. Firstly, planar laserscans are segmented by breakpoints [18]. Then, circles are fit to segments, and those with radius below 7 cm are retained as detections. Each detection is associated with the track with closest predicted position, if its distance is below 0.75 m, or initiates a new track otherwise. Tracks' states are estimated by α-β-filtering (α = 0.15, β = 0.02).
To maintain the problem size tractable on-line, the state x o describes the robot and, at most, 3 pedestrians, namely those maximizing a score ζ := −|y| − 0.25 max(x, 0) + 3 min(x, 0), where x and y denote a given pedestrian's position relative to the robot in its forward and lateral direction, respectively.
2) Static Obstacle: We perform a preliminary experiment, involving only the robot and a static obstacle of cylindrical shape to be detected as a leg by the robot's tracking system. In each trial, the robot is positioned around 12 m away from the obstacle and oriented such that it is facing the obstacle as precisely as possible. We perform two trials, using our method J (a 1 ) and a Velocity Obstacle approach based on [19], respectively.
3) Human Crowds: For experiments in human crowds, the robot uses our method to drive autonomously over a distance of around 15 m between uninformed pedestrians and a supervisor who is ready to remotely stop the robot at any moment. The supervisor usually walks behind the robot or by its side at a distance of around 2-5 m, except in one trial, where the supervisor deliberately interacts with the robot, shown in Fig. 10(d).
In total, 5 trials are performed.

A. Results
The Fig. 9 shows the robot's motion in the preliminary experiments with the static obstacle. It successfully avoids and passes the obstacle on the left side with either method.
For the crowd experiments, Fig. 10 plots estimated trajectory segments for the five trials. In one case (cf. Fig. 10(b.3)), a distracted pedestrian stands in the robot's way, such that the robot needs to avoid a collision by itself. Here, the robot exhibits the correct tendency but does not achieve sufficient clearance, and thus, the supervisor needs to stop it for safety reasons. In the remaining 4 out of 5 trials, the robot travels the full distance without interruption. The robot's speed's average and standard deviation over all trials are 0.82 ± 0.16 m/s. For the distance traveled along the x-axis and the duration of a trial, the mean and standard deviation are 15.48 ± 4.83 m and 18.80 ± 5.17 s, respectively.
We generally observe that, in order to prevent imminent collisions, pedestrians adapt more to the robot than vice-versa. Nonetheless, the robot's contribution to collision avoidance is mostly constructive, i.e. it changes direction in such a way that the minimum predicted distance increases. This is exemplified by the situations depicted in Fig. 10(a.1), (b.3), (c.2), (d.2), (e). We also observe one case where the robot's response to an imminent collision does not match that of the involved pedestrian, as the robot plans to avoid the pedestrian on the right, whereas the pedestrian plans to avoid the robot on the left, which is depicted in Fig. 10(b.1), (b.2). There, at a time when the pedestrian has already changed direction such that their paths would not cross, the robot rotates, attempting to cross the pedestrian's future path. In two cases (cf. Fig. 10(a.2),-(e), a pedestrian actively increases clearance to the robot, which is driving behind the pedestrian.

B. Discussion
The robot's contribution to collision avoidance is often constructive, as shown by the examples in Fig. 10, but relatively small, in comparison to pedestrians' contributions. On the one hand, the robot's ability to execute planned trajectories is limited by the robot's acceleration bounds and the absence of positionfeedback for tracking the trajectory (since our system's position estimates were found to be too unreliable for feedback control). On the other hand, in the example depicted in Fig. 10(b.1), it is visible that planned trajectories may lack consistency over time, i.e. they may fluctuate between alternative plans, such as avoiding someone on the left or on the right. It is clear that the robot will not execute either of the alternative plans properly as long as it keeps switching between them. A remedy could be found in restricting solutions to some admissible set, or in performing some global analysis to guarantee that a local minimum is chosen in a unique fashion.
In the situation depicted in Fig. 10(b.3), the robot plans its path between two pedestrians, but they do not stand far enough apart and one of them does not see the robot. The robot's failure to avoid a collision can be explained by (i.) a non-zero velocity estimate for the actually standing pedestrian due to the robot's inaccurate estimation of its own angular velocity, (ii.) poor approximation of E byẼ at low distances due to ε 1 , (iii.) the aforementioned limited trajectory tracking performance, and (iv.) the wrong assumption that the pedestrian will cooperate. Our approach generates motion plans in which all agents contribute to minimizing a collective cost by adapting to some extent to each other. Thus, such plans cannot account properly for non-interacting agents. Still, by updating its plan at a high frequency, the robot may successfully avoid completely passive agents or static objects (cf. Fig. 9).
Another limitation stems from assuming that the desired velocities of agents equal their current velocities, which could be overcome by a more elaborate probabilistic estimation technique. Arguably, this would help to predict the crossing order in situations as in Fig. 10(b.1), (b.2).

VI. CONCLUSION
We have applied IRL in a continuous framework [6] to multi-agent navigation. We employ a novel cost feature which smoothly approximates an interaction energy proposed by an empirical study of crowd motion [4]. Two alternative feature vectors are constructed, quantifying acceleration either by its squared or smoothed magnitude, respectively, and their weights are trained on examples from the public dataset DI-AMOR [15]. While both cost functions avoid collisions, the one based on squared accelerations favors smooth motions, whereas the alternative one lets agents adopt their target velocities quickly. We evaluate the models quantitatively on DI-AMOR and the public dataset ETH [16]. In comparison with a prior approach to linear IRL of multi-agent navigation [2], our models yield better predictions for ETH on average, particularly in the short term, because they penalize for sharp motions.
Experiments with the robot Qolo have demonstrated that our approach can plan reasonable trajectories on-line in real-world interactions with pedestrians. However, to avoid collisions reliably, our approach requires further developments in order to 1. stick to a decision on which side to avoid a given pedestrian, 2. penalize more strongly for contact at low relative velocities, and 3. take into account the robot's non-holonomic kinematics and dynamic constraints.

VI. ACKNOWLEDGMENT
This letter was recommended for publication by Editor Gentiane Venture upon evaluation of the Associate Editor and Reviewers' comments.