Autonomous Navigation System in Pedestrian Scenarios Using a Dreamer-Based Motion Planner

Navigation among pedestrians is a crucial capability of service robots; however, it is a challenge to manage time-varying environments stably. Recent deep reinforcement learning (DRL)-based approaches to crowd navigation have yielded numerous promising applications. However, they rely heavily on initial imitation learning and colossal positive datasets. Moreover, the difficulties in accurately localizing robots, detecting and tracking humans, representing and generalizing reciprocal human relationships restrict their deployment in real-world problems. We propose a Dreamer-based motion planner for collision-free navigation in diverse pedestrian scenarios. Our RL framework can completely learn from zero experience via a model-based DRL. The robot and humans are first projected onto a map, which is subsequently decoded into low-dimensional latent state. A predictive dynamic model in the latent space is jointly created to efficiently optimize the navigation policy. Additionally, we leverage the techniques of system identification, domain randomization, clustering and LiDAR SLAM for practical deployment. Simulation ablations and real implementations demonstrate that our motion planner outperforms state-of-the-art methods, and that the navigation system can be physically implemented in the real world.


I. INTRODUCTION
A UTONOMOUS driving systems are becoming prevalent in human society because of their promising prospects of high efficiency, safety, and intelligence. Additionally, an aging society, labor shortages, and noncontact services during the pandemic promoted the research and development of autonomous mobile robots in hospitals, restaurants, hotels, etc. [1]. However, socially aware robot navigation is a highly complex task because it involves mapping and localization, human detection and behavior analysis, social rules, and decision and planning [2].
On one hand, the modules of perception and motion planning are separately studied by autonomous driving companies and research institutes [3]; thus, human-vehicle interaction is not considered. Moreover, the surrounding humans are individually Manuscript  detected and tracked, and their motions are independently analyzed; therefore, the reciprocal relationships among pedestrians are excluded. On the other hand, state-of-the-art algorithms proposed by academia assume that states such as human number, position, and speed are fully known and simply focus on motion planning, which limits their generalizability and hinders sim-to-real transfer [4], [5], [6], [7]. In the earlier stages, rule-based motion planners played a dominant role in crowd navigation. Two pioneering approaches include optimal reciprocal collision avoidance (ORCA) [8] and the social force model (SFM) [9]. However, their one-step planning framework resulted in short-sighted, unsafe, and unnatural behaviors [10]. To plan motion over a long horizon, an intuitive strategy is to first predict human trajectories based on a model and subsequently select an optimal path for robots [11], [12]. Nevertheless, human motion models generally focus on individuals, ignoring the social relationships among pedestrians.
Recently, crowd navigation has switched to learning-based methods because of their prominent capability of representing the latent features of human-human and human-robot interactions and planning optimal navigation paths at a long time-scale. The collision avoidance with deep reinforcement learning (CADRL) algorithm [13] and its extension, socially aware CADRL (SA-CADRL), are the main algorithms used in the field of learning-based crowd navigation. These approaches feed all human states, such as positions and speeds, into deep neural networks (DNNs) to extract the implicit reciprocal motion features of humans, which are further fed into deep reinforcement learning (DRL)-based policy neural networks to learn an optimal motion planner. However, these two methods are not generalized to human numbers; thus, DNNs need to be redefined and retrained when the human number changes. Therefore, a subsequent algorithm, long short-term memory reinforcement learning (LSTM − RL), leverages LSTM to represent the relational motion features of humans to allow arbitrary human numbers [14], [15]. Because the inputs of LSTM neural networks are sorted by the distance between the robot and each human in descending order, the relationship among humans is not reciprocal.
To comprehensively describe the reciprocal relationship among pedestrians, the attention mechanism and graph convolutional network (GCN) are broadly embedded in DRL-based crowd navigation [5], [6], [10], [16]. However, the aforementioned navigation algorithms assume fully known human states, including human number, position, and speed, and rely heavily on imitation learning and colossal positive datasets collected by rule-based methods such as ORCA, which may result in an early sub-optimum. Moreover, precisely tracking humans and estimating their speeds is difficult in the real world owing to various uncertainties; thus, sim-to-real transfer is a crucial challenge for these methods.
To let go of the assumption of fully known human information in simulations, directly using raw sensor data is a promising alternative [17], [18], [19], [20], [21]. The reciprocal relationship among pedestrians is extracted from consecutive raw sensor data such as high-dimensional LiDAR scans with DNNs. However, direct handling of raw sensor data is inefficient. Consequently, imitation learning and colossal positive datasets are required to initialize DNNs. In addition, the lack of open-source solutions limits further development and comprehensive ablation.
Our study addresses these shortcomings in a comprehensive manner. First, we only detect humans and obtain their positions while excluding human tracking and speed estimation. After localizing the robot and humans, we create an RGB map that can be maturely processed using autoencoder algorithms to represent the instant relationship between humans and the robot. Accordingly, our algorithm can free the assumption of fully known environments and improve generalizability with respect to the human number and speed. In addition, inspired by the Dreamer approach, a model-based RL to address long-horizon tasks from images purely by latent imagination [22], we create a dynamic model with recurrent neural networks (RNNs) to accumulate history information and predict future states over a long horizon, thus reducing the probability of local optima. Moreover, the dynamic model can facilitate policy learning via the model-based DRL framework; thus, we can completely learn an optimal navigation policy without any imitation learning or a massive dataset. The contributions of this study are summarized as follows.
r A complete and publicly accessible autonomous navigation system among pedestrians is developed. We precisely obtain the robot pose using a LiDAR SLAM algorithm, and extract humans via a clustering approach. Moreover, we plan robot motion using a model-based DRL framework to avoid pedestrians and reach a target with a high success rate and navigation efficiency.
r We propose a Dreamer-based motion planning algorithm that can efficiently obtain an optimal motion planner and be generalized to arbitrary human number, variable human speed, and complex human relationships.
r We reproduce several state-of-the-art algorithms for more comprehensive ablation and ensure they are open-sourced. Additionally, sufficient sim-to-real experiments are implemented using domain randomization and system identification techniques. The code of the whole project is publicly available at https:// github.com/zw199502/navigation_among_pedestrians and the video is shown at https://youtu.be/KM2WPpQBfrI.

A. Navigation System
A complete navigation system integrates perception, decision, planning, and control modules, which are broadly researched and developed for autonomous driving vehicles [3], [23]. However, these modules are studied separately in both industry and academia. Although there are several mature and open-source solutions, such as Apollo 1 and Autoware, 2 publicly accessible navigation systems among pedestrians are rare in the community of service robots. In this study, we construct a complete navigation system, in which the perception and control modules are derived from mature methods and decision and planning are achieved by a novel Dreamer-based motion planner.

B. Motion Planning Among Pedestrians
In the early stage, rule-based motion planners constituted mainstream crowd navigation, such as ORCA [8], SFM [9], and trajectory-prediction-based path optimization [11], [12]. With the rapid development of deep learning, researchers and engineers are focusing on learning-based methods, wherein DRL-based navigation algorithms are attractive because of their promising representation and optimization capabilities [5], [6], [10], [13], [14], [15], [16], [18], [19], [20], [24], [25], [26]. CADRL [13] is a pioneering study in the use of DRL for social navigation. However, its value function neglects the social relationships among pedestrians, as it only considers the robot's full state and one pedestrian's observable state. LSTM − RL [14] improves upon CADRL by leveraging LSTM to represent pairs of the robot's state and all pedestrians' states, but its ability to capture reciprocal relationships is limited since pairs are ordered by distance and then fed into LSTM networks. Socially aware RL (SARL) [10] and relational graph learning (RGL) [16] represent state-of-the-art extensions to CADRL and LSTM − RL by using self-attention mechanisms and graph convolutional networks, respectively, to capture interactions and reason about relations between agents. However, these methods assume fully known pedestrian information and require massive positive datasets for learning, leading to degraded performance in real-time settings. EGO [18] and LSTM − EGO [21] offer a more direct mapping of raw sensor data to navigation actions, but open-source solutions for replicating their results are scarce. We propose an opensource Dreamer-based motion planner that can learn completely from zero experience and release the ideal assumptions of fully known environments.

C. Dreamer
The Dreamer algorithm is a reinforcement learning agent that addresses long-horizon tasks from images purely by latent imagination [22], which yields a large number of achievements in simulated environments, such as Atari games and MuJoCo robots [22], [27], [28], [29]. Conversely, we focus more on real implementations of collision-free and socially aware robot navigation by leveraging the key idea of the Dreamer algorithm. A map is created to represent complex scenarios with variable human numbers and random initial states. A dynamic model with a map as a unique observation is learned to represent social relationships among humans. The learned model can facilitate learning complex behaviors, thereby enabling the robot to learn an optimal navigation policy without any prior experience.

III. APPROACH
We first illustrate the formulation of the problem of navigation among pedestrians with model-based RL and subsequently describe the details of the model creation and navigation policy learning using the Dreamer algorithm. In addition, we introduce a completely autonomous navigation system with perception, planning, and control modules. Fig. 1 shows the overview of the Dreamer-based motion planner.

A. Problem Formulation
We formulate the problem of crowd navigation as a partially observable Markov decision process (POMDP) defined by a tuple (S, H, A, P, R, O, γ), where S represents the stochastic state space, H is the deterministic history feature space, A denotes the action space, P is the state transition model, R represents the instant reward space, O is the observation space, and γ is a discount factor.
The unique observations are the sequenced RGB images (o t , o t+1 , . . . , o t+K ), each of which illustrates the instantaneous positions of humans and the robot. We decode the highdimensional image o t+τ into the stochastic latent state s t+τ using considerably fewer variables. In addition, historical state information is accumulated as deterministic feature h t+τ . Given action a t+τ , stochastic state s t+τ , and hidden information h t+τ , the next state s t+τ +1 and accumulated feature h t+τ +1 can be derived from the state transition model P. Given policy π, we can represent the expectation of the value function starting from state s t+τ as follows: where r t+τ +i is the instant reward in state s t+τ +i . The goal is to determine policy π * to maximize the value function.
B. World Model 1) Observation: Instead of directly using fully known pedestrian states, including the human number, position, and speed [5], [6], [10], [13], [16], or clumsily dealing with raw sensor data [18], [19], we create a map using only the position information of humans and the robot. Therefore, we can reduce the uncertainties of human tracking, speed estimation, and trajectory prediction; generalize our method; and maturely process highdimensional image observations via representation learning. The image has three RGB channels, with the R channel representing humans and the B channel depicting the robot. We assume that the human is a circle with radius r h , and the human circle is projected as deep red pixels with the value p r = 255, as shown in Fig. 2. The pixel value p r is zero when the pixel is not occupied by a human. Similarly, the robot is assumed to be a circle with radius r r , which is represented by deep blue pixels with a value p b = 255. By contrast, we inflate the robot circle to define an uncomfortable zone between the robot and humans. The blue pixel value p b decreases when the pixel is far from the robot rim, and becomes zero when the pixel is outside the uncomfortable area. Note that the G channel is not used to represent the goal position in our current work because we fix the robot goal and randomize the robot's initial position, which is reasonable because the positional relation between the robot and the goal is relative.
2) Reward and Discount Factor: The reward function is defined according to the image observation. We first add the R and B channels together to yield a new 2D matrix m a . If any value of m a is equal to 510, a collision occurs, the reward is the minimum r c = −0.6, and the discount factor γ is zero. In addition, the reward becomes r o = −0.1 when the robot rim is outside the motion area, and γ remains zero. Another case of zero γ is when the goal distance d g from the robot is less than the threshold d r , that is, when the robot reaches the target. The reward is the maximum r g = 1.0. If the robot does not collide with humans, reach the target, or stay outside the motion area, γ isγ = 0.99, and the reward is a combination of the goal distance and the uncomfortable index. The maximum in m a is m max a and the uncomfortable index is defined as d u = (m max a − 255)/255. In summary, the reward function and discount factor are defined as follows: (2) 3) Action: We utilize a quadruped robot, which is an omnidirectional mobile robot. Accordingly, we define the action as two orthogonal velocities v x and v y . Additionally, v x and v y are continuous instead of discrete, as defined in state-of-the-art approaches [5], [6], [10], [13], [16]; therefore, we can generate smoother motions in the physical world.

4) Autoencoder:
We leverage autoencoder technology [30] to reduce the dimension of the image observation and extract the motion features of humans and the robot from sequenced observations. As illustrated in Fig. 1, we first encode the image observation o t with convolutional neural networks (CNNs) to extract the feature q t which is an intermediate vector. Next, we project the combination of q t and deterministic history information h t onto the posterior latent stateŝ t with a multilayer preceptor (MLP). Subsequently, we concatenate h t andŝ t and decode the concatenation to restore the observation imagê o t with transposed CNNs. The autoencoder is summarized as follows: where θ denotes the weights of the world model network. The loss function of autoencoder can be derived from the likelihood probability represented as below: Because motion planning is executed in the latent state space, while the reward r t and discount factor γ t are defined in the original observation space, we need another two networks composed by MLP to predictr t andγ t from the concatenated h t and s t :r Similar to the loss function (4), another two additional loss functions can be obtained as follows:

5) State Transition Model:
It is prohibitively challenging to construct a state transition model in the original observation space with high dimensions; therefore, we model the motions of humans and the robot in the latent state space. We utilize the categorical latent variables to represent the latent state with 32 classes multiplied by 32 variables (shown in Fig. 1), based on the fact that categorical distributions can naturally capture multi-modal uncertainty of stochastic state transition [31]. Given the posterior latent stateŝ t , action a t , and history motion information h t , the next latent states t+1 and the next hidden history information h t+1 can be predicted using a gated recurrent unit (GRU) neural network as shown in Fig. 1: where f θ which is composed by a GRU and ps θ constructed by a MLP correspond to the dynamics shown in Fig. 1. The prior distributions t+τ is required to be similar to the posterior distributionŝ t+τ derived from the autoencoder model; thus, the fourth loss function can be defined as the Kullback-Leibler (KL) divergence: where β is a constant factor weighing the KL divergence loss.

6) Overall Loss Function:
Given an episode obtained from the interaction between humans and the robot, we select a sequence starting from time step t and ending at t + K, where K is a constant. We fill zero to elongate the episode when its length is less than K + 1 because of early collision, outside motion, or target reaching. The overall loss function along the sequence is represented as follows: All world model networks, including (3a), (3b), (3c), (5a), (5b), (7a) and (7b), are jointly updated using this single overall loss function.

C. Motion Planner
The state transition model in compact latent space enables trajectory prediction in the long horizon without high-dimensional image observation, which results in a low memory footprint and speedy predictions of thousands of imagined trajectories in parallel [22]. As shown in Fig. 1, starting from the latent stateŝ t+τ and history information h t+τ , a considerable number of episodes with H horizon can be swiftly generated. Consequently, we can efficiently leverage imagined episodes to optimize the navigation policy.
For the imagination process, we create an actor network and a critic network using MLP to map the current latent state and historic motion information into the action and value function, respectively:ā where φ and ψ denote the weights of the actor and critic networks, respectively. Additionally, the rewardr t+τ and discount factorγ t+τ are predicted using (5a) and (5b), respectively. With Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
the imagined episode having a long horizon, we can evaluate the value function with a multi-step RL framework because it yields a better unbiased estimation than the one-step RL algorithm [32]: where λ is another discount factor that weighs the value function, and κ ranges from t + τ to t + τ + H. We have two objectives: to determine a policy to maximize the value function, and to minimize the error between the estimated value function and the predicted value from the critic network. Therefore, the weights of the critic and actor networks can be updated as follows: wherev mean n is the mean of the distributionv κ .

D. Algorithm Summary
In the simulation, we assume that the human motion is generated by ORCA [8]. Additionally, the robot is assumed to be invisible to humans; otherwise, it will be difficult to differentiate whether our motion planner is effective or whether humans avoid the robot. The simulation is reset when collision, outside motion, target reaching, or timeout occurs. When the episode length is greater than t max , it is terminated as a timeout. We use the step function to update the positions of humans and the robot and obtain the instant reward and discount factor. Simulation interaction samples are collected to update the world model. We subsequently utilize the world model to imagine episodes in the latent space, which are used to update the motion planner. Next, a motion planner is used to generate new simulation episodes. We alternately update the world model and motion planner until a stable navigation policy is acquired.

E. Complete Navigation System
As stated in existing studies, wheeled mobile robots can localize themselves with the wheel odometer [5], [6], [10], [13], [16], which however drifts heavily with the increase in motion time. External motion capture systems are commonly used to precisely obtain the robot position and orientation [15], [17]. Nevertheless, such systems are expensive and impractical in a real human society. A navigation system without any accurate and internal localization module is incomplete and cannot be applied to society. Additionally, these studies have not focused on publicly accessible real-world implementations, such as human detection, speed estimation, and trajectory prediction. Conversely, we leverage the LiDAR odometry and mapping (LOAM) SLAM algorithm [33] to accurately localize the robot. In addition, we extract humans using a clustering approach [34]. Because we can obtain the sequenced position information of humans and the robot, we can extract the latent motion feature using our algorithm without individually estimating the human speed and future trajectory. Because our algorithm directly outputs two orthogonal velocities v x and v y , we need to further match v x and v y to the speed command of the quadruped robot used for practical implementations. We found that the actual speed generated by the official controller was significantly different from that of the desired command. Therefore, we use the system identification technique to calibrate the desired speed command to be consistent with the actual speed [35]. We made our system open-source for easier deployment in the autonomous navigation community.

IV. SIMULATION AND REAL IMPLEMENTATIONS
We used three simulation scenarios: Simulation one is for a comprehensive comparison, Simulation two is designed to verify the generalizability of our method, and Simulation three is constructed for sim-to-real transfer. Subsequently, we directly deployed the policy learned in Simulation three into various real scenarios without any retraining or fine-tuning. . The i-th human moved back and forth from these two positions with a preference speed of 1 m/s. Additionally, reciprocal motion among humans is generated by ORCA [8] with the robot being invisible. The robot radius was r r = 0.3 m and the maximum values of v x and v y were both 1 m/s. For a fair comparison, we assumed that the robot moves from (−4.0, 0.0) m to (4.0, 0.0) m. We reproduced seven popular baselines: one is a non-learning method namely ORCA [8], whereas the other six are learning-based approaches, called CADRL [13], LSTM − RL [14], SARL [10], RGL [16], EGO [18], and LSTM − EGO [21] respectively.
The ORCA algorithm assumes that the agent's states, including shape, size, position, and speed, are fully known. Based on this information, it generates an optimal collision-free action in one step. However, this one-step planning approach may result in short-sighted, unsafe, and unnatural behaviors. In contrast, DRL-based crowd navigation algorithms, which utilize a value function that can represent accumulated return over a long horizon, have become increasingly popular due to their ability to address these issues. For example, CADRL was a pioneering study in this area, but its value function only considered the pair of the robot and one human, making it unable to represent relational interactions among humans. As a result, LSTM − RL was developed to pair the robot with all humans, but it still only captures partial interactions because it sorts the pairs by distance before feeding them into the LSTM networks. To more comprehensively represent social interactions among humans, SARL utilizes a self-attention mechanism to capture interactions within pedestrians, while RGL embeds a graph convolutional network to reason about relations between agents and compute interactions between them. However, these methods all require fully known human states. In contrast, EGO and LSTM − EGO can handle both static and dynamic obstacles of different shapes, sizes, and numbers, as they directly map raw sensor data to navigation actions. EGO uses continuous LiDAR scans, while LSTM − EGO uses one frame of LiDAR scan and embeds LSTM to deal with sequential scans. However, a downside of these methods is the difficulty of efficiently learning a feasible navigation policy. While the other four have open-source solutions, publicly accessible resources for EGO and LSTM − EGO are rare. Therefore, we specially created a simulator that could generate 2D LiDAR scans and constructed neural networks for policy learning. We named our method navigation among pedestrians with a Dreamer-based motion planner (NPD). The learning processes of the six methods are illustrated in Fig. 3, and the final evaluation is shown in Table I. Please note that Fig. 3 displays the success rate of 100 evaluation episodes. Each episode may result in success, collision, overtime, or outside motion. Our analysis revealed that when the success rate was at or above 90%, the number of collision cases decreased to less than 5, sometimes even to 1, whereas most of the cases were overtime. Notably, in overtime cases, the robot was very close to the target. We hypothesized that the image resolution may have contributed to this phenomenon. As the robot was approaching the target, we reclassified the overtime cases as success in the final 500 tests. Because baselines CADRL, LSTM − RL, SARL, and RGL initialized the neural networks with imitation learning and fill the experience pool with a large number of positive samples ahead of the training, they could swiftly learn a feasible navigation policy. Although EGO's initialization was same, its learning was notably unstable and it could only reach a success rate of 0.54. The initial data settings of LSTM − EGO were same as ours, however, its success rate was 0.52, significantly lower than our method's. In addition, LSTM − EGO required the longest navigation time, averaging 15.08 s. We found that the success rates of all the baselines which require prior initialization became zero at all times when we removed imitation learning and the massive positive dataset. Conversely, our method could completely learn from zero experience and yielded a stable convergence and high success rate, which indicates that our method does improve learning efficiency. As shown in Table I, our approach quantitatively either outperformed or behaved similar as the other six learning baselines with respect to both the success rate and average navigation time. We found that the failure of our method initially occurred when two of the humans closely surrounded the robot. Additionally, the collision occurred within five time steps of 1 s. We believe that this short sequence results in the most of the remaining 1% failures. Although ORCA requires a short navigation time, it produced the lowest success rate, whereas our method was able to adequately balance the navigation time and success rate. Because our final goal is to deploy the navigation policy on real robot platforms, real-time performance should be another evaluation factor. The time used to generate an action when given observations is referred to action time. ORCA's action time was only tens of microseconds because ORCA is a non-learning approach. However, CADRL, LSTM − RL, SARL, and RGL required tens of milliseconds to derive an action because they only have a value network and need to inquire each action choice to obtain the best one. On the contrary, EGO, LSTM − EGO, and NPD(ours) have both an actor network and a critic network, therefore, they are able to quickly access an optimal action from the actor network. Because the planning frequency in the real world is motions follow ORCA rules. We also deviated the human action originally generated using ORCA by adding uniform noises with the bound of 0.2 m/s. The success rate of our method could still reach 94%, which indicates the generalizability of the trained policy with respect to human motion modals.
2) Training for Generalizability Verification: We trained our model in a more challenging navigation task to further verify the generalizability of our approach. Similar to the scenario shown in Fig. 2, the motion area is 10 × 10 m. Differently, the obstacle number in the revised environment is up to 7, which makes the training scenario denser and more complicated. Moreover, the number of moving humans changes from 1 to 4. Additionally, we add rectangular static obstacles whose number is variable from 1 to 3 and side length ranges from 0.3 m to 0.4 m. Baselines CADRL, LSTM − RL, SARL, and RGL assume that the obstacles are circles and their number is fixed during the whole training, whereas baselines EGO and LSTM − EGO are independent of obstacle number and shape. We choose LSTM − EGO as the ablation simply because it is a more recent study. The training process is depicted in Fig. 4. Although the environment becomes complex, our method can still reach a 93% success rate, significantly outperforming LSTM − EGO whose success rate is below 40% at all times.
3) Training for Policy Transfer: To enable sim-to-real transfer, we leveraged the domain randomization technique to improve the generalizability of the simulated navigation policy. Because the maximum sideward speed of the real quadruped robot was 0.27 m/s, we first constrain the maximum v y as 0.27 m/s and the maximum v x 0.3 m/s. Note that we did not enlarge the forward speed v x for slow and stable motion in our small real scenarios with a size 3 × 3 m. Different from the simulation configurations considered for comparison, we narrowed the motion area to L = 6 m, and the human number changed from 1 to 3. Additionally, we randomized the initial robot position while keeping the goal fixed at (1.0, 0.0). Moreover, the initial human position was randomized over the entire motion area, while the human goal was distributed around the margin of the motion area. The preference speed of humans ranges from 0.15 m/s to 0.3 m/s. For RL-based baselines, obtaining a feasible navigation policy is challenging if the environment significantly changes. Moreover, certain baselines, such as CADRL and SARL, do not allow variable human numbers during training. Although we introduced a large number of stochastics, our algorithm could produce stable policy optimization while LSTM − EGO failed Fig. 5. Learning process with stochastic configurations in simulation. Different from the training for comparison, this training is executed in stochastic environments with variable human numbers and distributions to learn a more generalized navigation policy. NPD considers the robot's collision margin as a circular shape, whereas NPD-CM uses a more accurate collision margin that is a circumscribed rectangle around the physical robot. RGL's learning is omitted because its success rate is zero at all times.
in obtaining a feasible navigation policy, as shown in Fig. 5. The trained policy achieved a 95% success rate in the final evaluation with 500 random settings.
In addition to comparing with the learning-based baselines LSTM − EGO and RGL, we conducted ORCA as another ablation. We found that ORCA could significantly improve navigation success rate from 0.47 to 0.98 when human number was decreased from 5 to 3. Although ORCA (0.98) outperformed our method (0.95) in simulation, we found that the robot was inclined to move outside of the specific area in real scenarios. The possible reason may be inaccurate human tracking and speed estimation. Conversely, our method only needs human positions, without considering the errors of human tracking and speed estimation, therefore, our approach is able to proficiently deal with various real-world scenarios.
To align the robot collision margin with the actual robot platform, we replaced the inflated circular margin with a rectangular one. This modification enabled us to match the collision margin of the simulated robot with that of the real robot platform. The collision margin has a length and width of 0.5 m and 0.3 m, respectively, identical to the collision margin of our real robot platform. The corresponding learning is illustrated in Fig. 5, which indicates that our method can deal with different robot collision margins.

B. Real Implementations
We deployed the policy learned from the simulation on a quadruped robotic platform equipped with a Velodyne VLP-16 LiDAR. The tests are shown in the attached videos. Although the human number changed from 1 to 3 and human motions are diversified, our simulated navigation policy could be directly transferred into real scenarios without any retraining or fine-tuning, which shows the potential of our method to model complex reciprocal human relations and navigate robots among pedestrians in the real world.

V. CONCLUSION
This letter presented an autonomous navigation system with a Dreamer-based motion planner. We let go of the assumption of fully known human states and only utilized the human position information. The human positions and robot location were projected onto a image. From the sequenced image observations, we extracted reciprocal relationships among pedestrians through representation learning. In addition, we created a state transition model using the extracted latent information to imagine episodes for reinforcement learning. Sufficient simulation ablations demonstrated that our method could learn from zero experience with high efficiency and outperformed state-of-theart algorithms. In addition, we leveraged the techniques of system identification, domain randomization, clustering, and LiDAR SLAM to enable sim-to-real transfer. Adequate real implementations illustrated the potential of our method to model complex reciprocal human relations and navigate the robot among pedestrians in the physical world. Our future study will focus on accurate human detection, precise robot localization, and universal navigation policy.