Multi-modal Neural Feature Fusion for Automatic Driving through Perception-aware Path Planning

Path planning is a significant and challenging task in the domain of automatic driving. Many applications, such as autonomous driving, robotic navigation and aircraft object tracking in complex and changing urban road scenes, need accurate and robust path planning by detecting obstacles in the forward direction. The traditional methods only rely on the path search method without considering the environmental factors, the vehicle path planning method cannot deal with the complex and changeable environment. To deal with above problems, we propose a perception-aware based multi-modal feature fusion approach that combines visual-inertial odometer (VIO) poses and semantic obstacles in the forward scene of vehicles to plan driving paths. The proposed method takes environment awareness as the guide and combines path search algorithm to realize path optimization task in complex environment. The proposed approach first uses a long short memory network (LSTM) to build a VIO that fuses visual and inertial data for pose estimation. To detect obstacles, the proposed method uses a segmentation model with a lightweight structure to extract semantic 3D landmarks. Finally, a path search strategy combining an A* algorithm and visual information is proposed to plan driving paths for intelligent vehicles. We estimate the proposed path planning method on assimilated scenes and public datasets (KITTI and Cityscapes) by using a micro controller (Jetson Xavier NX) installed on a small vehicle. We also show comparable results with path planning that only uses the greedy algorithm or heuristic algorithm without using visual information and show that our method is adequate in coping with different complex scenes.


I. INTRODUCTION
Visual perception is an essential part of automobile intelligence. As the autonomous driving industry continues to advance, the demands of perception-related technologies are also growing continuously, which undoubtedly pushes forward the rapid development of technologies. In recent years, LIDAR is widely used in visual perception [1], [2], [3], [4] for automatic driving. Although LIDAR has an advantage in accuracy and robustness, the price is expensive, which undoubtedly increases the development and manufacturing costs of intelligent vehicles. Therefore, many researchers have begun to focus on visual perception methods based on low-cost cameras, all of whom use only a low-cost camera or cameras for performing automatic driving tasks [5], [6], [7], [8]. The related applications mainly include localization and navigation, lane detection, traffic light identification, pedestrian detection and building recognition. In general, visual perception is the foundation of path planning, and the key to path planning is the ability to help moving vehicles achieve pose adjustment to avoid detected obstacles in front of the vehicle. Visual perception not only includes pose estimation but also includes obstacle detection. In the early years, researchers obtained pose estimation results of robots by a single camera or IMU, which shows many advantages over radar, such as being low in cost, light in weight, and easy to install [9], [10], [11], [12]. There are some drawbacks, however, including detection accuracy, distance, stability and being easily disturbed by the external environment. For example, when the light changes, the images captured by the camera are easily blurred. Additionally, the IMU is prone to producing a large drift error when the vehicle turns sharply. These drawbacks are not conducive to realizing automatic driving.
To address these problems, researchers have proposed alternative solutions, in which the most widely used solution is to fuse the camera with IMU to compensate for the disadvantage of only having a single sensor. The existing approaches mainly include VIFTrack [13], Maplab [14], Iceba [15], and PIVO [16], which almost always use fused data for some recognition tasks such as feature tracking, automatic navigation or pose estimation. By comparing these approaches with a single sensor approach, we can see that VIOs have better performance for visual recognition tasks. However, these methods can only cope with problems of localization and navigation and are unable to cope with the problem of path planning for automatic driving. It is challenging for intelligent vehicles to drive autonomously in complex urban street scenes. Intelligent vehicles need to have the ability of scene classification as well as localization and navigation. At the same time, vehicles need to detect categories of objects ahead of them in order to develop a strategy for obstacle avoidance based on a path planning algorithm. However, none of the existing methods address the problem of scene classification in terms of path planning in automatic driving. The proposed scene classification methods only focus on object recognition or local scene recognition rather than a path planning strategy [17], [18], [19]. To address these problems, a novel path planning method combining VIO and scene information for path planning is proposed in the paper. In summary, the main contributions of this paper are shown as follows: (1) we propose a perception-aware path planning method that combines visual pose estimation, obstacle detection and path searching algorithm. 2) we propose to use a deep VIO model to output pose information.
3) we construct a lightweight semantic segmentation network for obstacles detection. 4) The proposed method is evaluated on a public urban street scene benchmark and a simulative environment based on a Jetson Xavier NX installed on a small vehicle using the standards of existing methods.

II. RELATED WORK
In this section, we discuss the contents closely related to our proposed method. The advanced nature and limitations of some methods are discussed, and the problems and advantages that we can solve are also discussed.

A. POSE ESTIMATION
In computer vision, estimating pose is a challenging task for robots, especially with a weak GPS signal or without a GPS environment. This is mainly reflected in the directional adjustment of vehicles when driving. The most popular approach to address pose estimation is to combine visual and inertial information, which has many advantages over only using a single sensor, like a camera or an IMU. As we know, the camera is prone to inaccuracy in extreme environments, for example, weak illumination, heavy rain or dense smog.
The IMU also has a disadvantage in that it will cause accumulative drift error over time. Therefore, fusing visual and inertial data combines the complementary advantages of the two kinds of sensors in an intelligent vehicle navigating to a destination. For example, Zhu A. et al. [20] proposed an algorithm that fuses a pure event-based tracking algorithm with an IMU to provide 6-DoF camera pose tracking. Alatise, M.B. et al. [21] proposed a low-cost and precise location method for mobile robots through combining a 6-DoF IMU (including three axes of the accelerometer and three axes of the gyroscope) with a camera. Because of motion blur and latency, visual data from capturing cameras are incomplete, which presents a challenge for pose capture. To cope with this, some bio-inspired vision methods have been proposed. For example, Mueggler, E. et al. [22] proposed using a continuous-time representation to perform VIO through an event camera that could deal with the high temporal resolution and asynchronous characteristics of the sensor. Rebecq, H. et al. [23] presented a novel, accurate and tightly coupled VIO pipeline, which exhibited outstanding performance in estimating the camera's self-motion under challenging conditions. To track non-corner shaped features, Bloesch, M. et al. [24] proposed using an iterated extended kalman filter (IEKF) to fuse tight inertial measurements with visual data from one or more cameras to form a robust VIO. With the rapid development of deep learning, some deepVIOs based on deep neural networks show many advantages over these traditional methods. The deep VIOs have the advantages of self-extraction and self-coding. Furthermore, the extracted neural features are robust in extreme environments. In recent years, some deep VIOs, e.g., DeepFuse [25], DeepVIO [26], Vinet [27], and Self-VIO [28], [29], have been proposed for pose estimation and perform better than traditional methods. In our work, we use two kinds of deep learning models (CNN and LSTM) to encode visual and IMU frames. Then, we fuse the two kinds of data from a camera and an IMU within the VIO for pose estimation.

B. SCENE SEGMENTATION
Scene segmentation is a fundamental task of environmental modelling in automatic driving. In an urban scene, the ability to classify every object (vehicles, buildings, trees, and pedestrians) around the vehicles is necessary for safe driving.
The key task for vehicles is to identify each obstacle that appears in the front of the vehicle by pixel-level segmentation. Pixels existing in the high-resolution image are captured by a camera mounted on a vehicle, and need to be classified as a set of semantic tags. Differently from other scenes, the scale variation of objects in the automatic driving scene is large, which brings great challenges to high-level feature representation. Therefore, encoding the multi-scale information correctly is imperative. Yang, M. et al. [30] proposed introducing "atrous" convolution to generate features with larger receptive fields without sacrificing spatial resolution. Luc, P. et al. [31] presented an autoregressive CNN architecture that learns to generate multiple frames iteratively. Due to the large difference between labeled training or source data in the real world, significantly decreased performance is observed which cannot be easily remedied. To overcome this problem, a novel UDA framework [32] based on an iterative selftraining (ST) procedure was proposed, which is meant to address this difference through latent variable loss minimization. Most semantic segmentation methods rely heavily on pre-trained networks, which were originally used to classify images as a whole. Although the networks show excellent recognition performance, these methods perform poorly in terms of localization accuracy. To address this problem, Pohlen, T. et al. [33] presented a novel architecture that has strong localization and recognition abilities without pre-training. Because of the extraordinary computational complexity involved, many research works have proposed using lightweight architectures to decrease the computational complexity by reducing the number of network layers. For example, M. Orsic, I. et al. [34] presented an alternative approach which achieves a significantly better performance across a wide range of computing budgets. In our work, we also present a lightweight model with a bottleneck structure based on ResNet for semantic detection, which is end-to-end trainable.

C. PATH PLANNING
The task of scene perception is to help intelligent vehicles distinguish their surrounding environment. Additionally, the vehicle needs to have the ability to avoid obstacles in complex environments. In general, the shortest and most efficient path from the starting point of the design to the final destination, while avoiding obstacles during automatic driving, is the primary goal of most research. In recent years, studies have focused on the path planning of unmanned ground vehicles. The goal is to find the shortest, most stable, most economic and safest path under the conditions of collision avoidance, motion boundary and speed constraints in the presence of obstacles and water flow. For example, Ma, Y. et al. [35] referred to this problem as a multi-objective nonlinear optimization problem with generous constraints and proposed to adopt an augmented multi-objective particle swarm optimization algorithm to find the solution. Broumi, S. et al. [36] applied the Dijkstra algorithm to solve the shortest path problem. However, while the traditional Dijkstra algorithm can find the shortest path, it skips over other paths with the same distance and cannot deal with the isometric shortest problem. To address this problem, an improved Dijkstra algorithm is proposed to solve the problem of path planning in a rectangular environment, and it can determine all equidistant shortest paths [37]. The problem of path planning can also be treated as an engineering optimization problem, in which the shortest collision free path is taken as the criterion to optimize the driving path [38], [39], [40]. With the rapid development of machine learning, learningbased path planning shows many advantages. Wang, B. et al. [41] propose a learning-based technique to address the problem of replanning when the robot encounters a conflict. Wen, S. et al. [42] proposed using a fully convolutional residual network to recognize the obstacles and obtain depth images. The avoidance obstacle path is planned by the "dueling DQN" algorithm in the robot's navigation that exploits environmental spatial-temporal information. Zhou, X. et al. [43] proposed to use deep reinforcement learning algorithms for USV and USV formation path planning, which specifically focuses on safe obstacle avoidance in maritime scenes. In real-world scenarios, considering an ideal situation with a known environment, workable maps for real applications are big. To deal with above problems, Orozco-Rosas, U. et al. [44], [45] propose a hybrid path planning algorithm based on membrane pseudo-bacterial potential field, which is able to reduce the computational or time. However, the above methods do not have scene information that can provide vehicles with a reference of actionable decisions. These methods have to rely on a greedy strategy to make the search path longer. Therefore, we propose a novel path planning method that combines prior visual information and the A* algorithm. Like Dijkstra, the A* algorithm is generally used to search for the shortest paths; also similarly to BFS, it can guide itself with a heuristic function. Because of its many advantages, applications based on the A* algorithm have been proposed to deal with path problems [46], [47], [48]. In this paper, we also use the A* algorithm to plan the vehicle's moving path by combining previous pose estimation and scene segmentation.

III. METHODOLOGY
In the paper, we propose a perception-aware path planning approach that combines multi-modal data and a path search algorithm. The proposed approach consists of four modules: feature encoding, pose estimation, obstacle detection and path planning, as shown in Fig. 1. The first module is mainly responsible for visual feature coding of the original data (visual and IMU data). The second module performs obstacle detection by using a lightweight segmentation network with an encoder-decoder. The third one performs pose estimation by a deep VIO that is constructed by using an LSTM. Finally, the last one performs path planning by combining scene perception and a path search algorithm for developing a safe obstacle avoidance strategy.

A. POSE ESTIMATION WITH A FUSION OF MULTI-MODAL NEURAL FEATURES
In this section, we encode an image and IMU frame by a lightweight CNN and LSTM and build a deep VIO to fuse two kinds of feature vectors that are used to estimate 6-DoF poses of the vehicle. After encoding, the image and IMU frame are translated into a visual feature vector and an inertial feature vector: 1 2 , , , , ,..., where v X is the visual feature vector, and i X is the inertial feature vector. , k v t x and , k i t x correspond to two kinds of modal frames that appear at the same time.
Since the frequency of data gain is different and that of the IMU (100Hz ~ 500Hz) is about ten times that of the camera (10Hz ~ 50Hz) data gain. To cope with this problem, an LSTM is utilized to encode the inertial frame. Then, the visual and inertial data are fused together, which forms a new feature vector Furthermore, we introduce the LSTM to build a VIO, which is used for pose estimation. Given an image-IMU sequence, a mapping relation between the original frame and feature space is constructed by transforming the input sequences of images and IMU data to a space vector: | , , , , , , x x x y z q q q q = = ⊕ ∈ p X (4) In our work, we consider translation and rotation as motion on the manifolds and decompose the camera's motion into a rotation motion and a translation motion in the vector space.
Then, the rotation motion and translation motion are translated into a rotation matrix and a translation matrix by a special Euclidean group SE(3) : SO 3 (5) Every Lie group SE(3) has a matching Lie algebra ( ) 3 se that describes the local properties of Lie groups, represented as a vector φ defined in SE(3) . We built an LSTM network to cope with the camera's motion on the manifold. The LSTM performs the mapping from the image data to the Lie . Every moment of the camera's motion has a matching Euclidean group SE(3) , which forms a trajectory flow by recording the time sequence. The record of the camera's motion is shown in Fig. 2. Next, the inertial vector is fed into three fully connected neural layers for pose regression. The fully connected neural layers regress 3D translations and 4D rotations as a quaternion and map the fused feature vector into a vehicle's pose vector p x :   Semantic segmentation is one of the key problems in computer vision. On the one hand, semantic segmentation is a high-level task that paves the way for an understanding of the scenes. The importance of scene understanding as a core  3,4,5,6,7,8,9) -----64×128×128

B. SEMANTIC 3D OBSTACLE DETECTION
Bottleneck-2.1 Down-sampling 128×64×64 Bottleneck-2.x (x=2, 3,4,5,6,7,8,9) dilated 2(2,2), asymmetric 5(2,3), dilated 4(2,4), dilated 2(2,6), asymmetric 5(2,7), dilated 4(2,8) 128×64×64 Bottleneck-3.x (x=2, 3,4,5,6,7,8,9) -----128×64×64 3,4,5,6,7,8,9) -----64×128×128 3,4,5,6,7,8,9) -----16×256×256 Full-conv -----C×512×512 computer vision problem lies in an increasing number pf applications that provide "nutrition" by inferencing knowledge from images. Some of these applications mainly include self-driving vehicles, human-computer interaction, virtual reality, etc. In recent years, with the popularity of deep learning, researchers have constructed some complex networks with a deeper and deeper structure to address semantic segmentation problems. The most common one is the convolutional neural network, which exceeds traditional methods in both accuracy and efficiency. On the other hand, semantic segmentation can predict pixel-level object categories. Generally, scene segmentation provides a comprehensive scene description for vehicles, including object categories, location and shape. In our work, we present a lightweight asymmetry semantic segmentation model consisting of three parts: initial block, down-sample and upsample, as shown in Fig. 3. Due to combining the advantage of ResNet, the proposed segmentation model has a strong ability to perform pixelwise classification. In addition, in order to reduce computing cost and improve computing efficiency in reference to E-net [47] we reconstruct this network using many bottleneck layers to reduce weights. The proposed network architecture is shown in Table 1. The semantic segmentation network consists of two parts: encoder (down-sampling) and decoder (down-sampling). The "Bottleneck" structure includes two filters with the size of 1×1, which is used for lower and raise feature dimensions. In the stage of down sampling, the output of feature dimension decreases with multiple ratios. In the stage of up sampling the output of feature dimension increases with multiple ratios. Additionally, scene segmentation tries to produce labels for each pixel based on different categories appearing in a scene, as shown in Fig. 4. In our work, we select RRelu as the activation function and select softmax as the object classifier to detect all objects. According to the semantic features previously calculated, the classifier can detect every object appearing in the scene as follows: { }

C. PATH PLANNING FOR AUTOMATIC DRIVING
Path planning is a key step towards automatic driving for traditional vehicles. Combining previous scene segmentation, intelligent vehicles can detect obstacles distributed around a vehicle by identifying semantic information. According to path planning, the intelligent vehicles can also create an obstacle avoidance strategy. Combining the path search algorithm, the vehicle can optimize the selected path to make the best driving path. Based on previous pose estimation, the intelligent vehicle can also avoid obstacles by adjusting the driving poses. In our work, we learned from heuristic search methods used in games, combining the A* algorithm and prior visual detection, to find a safe driving path. The main function of the path-finding strategy is to find the best path for obstacle avoidance and to minimize the cost (fuel, time, distance, equipment, and money). Compared with other algorithms, A* not only has high search accuracy but also fast search speed through combining previous visual information, which can meet the technical requirements of future self-driving vehicles. The path planning consists of the starting point, the end point, all possible paths between the two points, the intermediate nodes (the intermediate state) and the search cost. In general, the path search uses a heuristic function, namely, the approximate cost from any node to the end point, to estimate the reward value quickly. The output is the optimal path from the start node to the end point, that is, the least costly. A good heuristic function will make this search operation as efficient as possible or search for as many paths as possible. Therefore, the cost estimate from the initial state can be calculated as follows: ( ) ( ) ( ), f n g n h n n N = + ∈ (11) where ( )  Table 2.
The results are also shown in Fig. 5, which shows that the search time increases with the increase of obstacles under the unchanged search distance and changed search nodes. In addition, the increase of obstacles will raise the difficulty of path planning under the same starting point and target point conditions, which can also be shown from the continued increase in search time.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
In this section, we discuss the establishment of the experiment, including the selection of datasets, the setting of VOLUME XX, 2017 experimental parameters and the layout of experimental scenes, and also discuss and analyze the experimental results.

A. DATASETS (1) KITTI dataset
The KITTI dataset is used to assess the stereo and optical flow, visual odometry (VO), 3D object detection and 3D tracking performance of computer vision technologies in vehicle mounted environments. KITTI includes real image data from urban, rural and highway scenes. In the paper, we use only 11 sequences for visual evaluation, which have corresponding ground truth poses obtained by a GPS reader.

(2) Cityscapes
Cityscapes is a new large-scale automatic driving dataset, which contains a set of different stereo video sequences recording street scenes from 50 different cities with highquality annotation of 5K frames at the pixel level, except for a set of 2K frames with weak annotation. In our work, we use "leftImg8bit_trainvaltest" for performance evaluation.

B. POSE ESTIMATION WITH A VIO
Pose estimation needs to monitor the changes of driving direction, which consists of translation and rotation motions. Based on the trajectory of rotation, we can attempt to predict the vehicle's future trajectory to build some references for the control center of intelligent vehicles for path planning. In this paper, we train a deep learning-based VIO model to estimate the pose, as shown in Fig. 6. We use the Euler angle ( yaw , pitch and roll ) to describe the vehicle's rotation in the motion space. In addition, we also use three separate rotation angles.
Furthermore, we select the APE (absolute pose error) as the evaluation metric of pose estimation. The APE is often used for evaluating a vehicle's absolute trajectory error. The corresponding vehicle's pose is directly estimated and referenced by the pose relationship. Then, the statistics of the whole trajectory are calculated, which are useful for testing the global consistency of the trajectory. It can be seen from Fig. 7 that the results based on the proposed method on kitti_07 are better than the results of other sequences, and the APE is in the range of 0 ~2. The result on kitti_01 is very poor, and its APE is in the range of 3 ~ 17. On the whole, the APE of the 11 KITTI sequences is distributed mainly in the range of 0 ~ 7, which indicates that the proposed method has reached an advanced level. The results on kitti_01 is poor, mainly due to the many sharp turns on this path that results in instant trajectory drift.

C. SEMANTIC SEGMENTATION FOR OBSTACLE DETECTION
In our work, by learning from E-net [48], we reconstruct a segmentation network. Compared with E-net, the reconstructed network has a deeper layer and a symmetrical down-sample and up-sample structure, which is different from E-net and ensures a better scene segmentation effect to a large extent. In addition, compared with other popular semantic segmentation networks, the proposed network has lightweight parameters (only has 2.3M weights). Therefore, the proposed segmentation network can carry out image segmentation in real time on a small vehicle processor. Theexperiment is running on an NVIDIA Jetson Xavier NX, which has the same size as the Jetson Nano, but with better performance. In the process of segmentation, the dataset is set as 166 samples for training and 500 samples for evaluation, the epoch is set as 3, and the iteration is up to 118920. It can be seen from Fig. 8 (a) and (b) that as the number of iterations increases and the gradient of network learning rate decreases, the training loss decreases gradually until it approaches 0.
To evaluate the performance of the proposed method, we utilize PixAcc (pixel accuracy) and mIoU (mean intersection over union) as evaluation metrics of segmentation, as shown in Fig We also evaluate the performance of semantic segmentation by comparing the pro-posed method with other state-of-theart methods, e.g., Fcn8s, Fcn16s, Fcn32s, E-net and Led-net. In our work, the front-end of the network model in the experiment is the above network model in turn, and the backend takes vgg-16 as the skeleton. We adopt a sigmoid parameterized activation function to encode the cost performance of segmentation, which is able to comprehensively evaluate the accuracy and computational cost. acc P and mIoU P are adopted as the evaluation metrics of cost performance considering precision and mIoU: where Acc and mIoU represent the accuracy and mean intersection of union (mIoU), sigmoid represents the parameterized activation function, and w and t represent the weight and the cost time of an iteration. α and β represent dynamic parameters in the activation function. Fig. 9 and Fig. 10 show the results of accuracy and mIoU for the proposed method and other popular methods. It can be seen that the performance of the proposed method is better than that of Fcn8s, Fcn16s, Fcn32s and Led-net considering the accuracy or mIoU; its weight and time consumption are slightly lower than that of Fcn8s. The main reason is that after training, the weight of our proposed network model is smaller, approximately one thirtieth that of Fcn8s, Fcn16s, Fcn32s and Led-net, and the iteration speed is faster, although it is slightly lacking in performance.

D. PATH PLANNING FOR OBSTACLE AVOIDANCE
The path planning of vehicles needs to deal with the challenging problems of obtaining safe and effective driving routes for autonomous vehicles. In fact, path planning technology is currently a very hot research topic. What makes path planning so complex is that it covers all areas of autonomous driving technology, from the most basic brakes, to sensors that sense the environment, to locating and forecasting models. To verify the proposed method, we conduct some experiments in simulative scenes. The experimental scene is flat with a length and a width of 8 meters. We place some obstacles in this scene to simulate a real automatic driving experience, as shown in Fig. 11 (a), and take a small car equipped with a Jetson Xavier NX as an automatic driving vehicle. The distribution of obstacles and start-end points is shown in Fig. 11 (b) and Table 3, which  0  20  40  60  80   20   40   60 80 (20,20)    present the basic settings, as well as the process of path searching, including searching distance, searching times and searching nodes. Additionally, we run our model on the CPU and GPU, respectively, and compare the time cost in each processing stage, as shown in Table 4. It can be seen that in the stage of image pre-processing and visualization, the CPU is faster than the GPU, but in the stage of network training, the GPU is significantly faster than the CPU. In the stage of image post-processing, the processing speed is similar to that of the GPU. Finally, in terms of total time, the GPU is faster than the CPU. Fig. 11 (c) presents the comparison between the path trajectory and ground truth, which can be seen that with different starting point and end point setting conditions, the real measurement data and the ground truth are basically consistent, which ensures the safety and success of avoiding obstacles and guarantees that the vehicles reach the target point quickly and efficiently.

E. PERFORMANCE COMPARISON WITH STATE-OF-THE-ART METHODS
We randomly generate a 7×7×7 (m) 3D map full of obstacles to evaluate the effectiveness of various existing path planning methods, which include the proposed perception-aware path planning method, A* [51], Disjkstra [52] and SPRM [53], as shown in Fig. 12. On the same map, the proposed perceptionaware path planning method gives the smallest search distance and least steps while achieving a higher success rate. However, due to the adoption of the planning method of "scene perception + A*", the total path planning time is composed of image processing time and path searching time. Therefore, the time of path planning is slightly longer than that of other algorithms. Fig. 13 presents results of examples based on the proposed perception-aware path planning method in an outdoor environment.

V. CONCLUSIONS
In this paper, the multi-model perception-aware method is proposed to solve path planning problems under complex environment. The proposed method carries out path planning task based on the scene perception (including pose estimation and scene segmentation) and path searching. We first evaluate the proposed pose estimation model by comparing it with the ground truth on the KITTI dataset based on the metric of APE. The mathematical analysis showed that the proposed model is consistent with the ground truth in the view of pose trajectory.
Then, we evaluate the proposed lightweight segmentation network by comparing with state-of-the-art methods based on mIoU. According to the comparison results based on accuracy and mIoU, we conclude the proposed lightweight model obtain a higher obstacle detection accuracy. Finally, we carry out an experiment in a simulated environment full of obstacles and compare the pipeline with existing path planning methods that do not use scene perception. The results show that the path planning method based on perception-awareness is better than methods that do not refer to scene information in terms of path search time, search distance, search steps and success rate of obstacle avoidance.
Although the above advantages are presented, there are still some problems that have not been solved. For example, we have only verified the method in a simulation environment (static environments) and not in a real, complex environment. The proposed path planning method is very dependent on the model design, so vehicles do not have the ability to learn actions independently. Therefore, under extreme driving conditions, the proposed method may be extremely unstable, especially in a dynamic environment. Therefore, in the future, we will focus on the algorithm design of pose learning for intelligent vehicles and optimal design of engineering problems [54][55][56].
In addition, to deal with the path planning of dynamic environments, a deep reinforcement learning (DRL) model will be presented in the future work. In the process of outputting the pose and path, the vehicles can learn this behavior online and constantly optimize the path, which will make the vehicles more intelligent and autonomous as time goes on to meet the needs of future driverless vehicles under dynamic scenes.