Exploring Contextual Representation and Multi-Modality for End-to-End Autonomous Driving

Learning contextual and spatial environmental representations enhances autonomous vehicle's hazard anticipation and decision-making in complex scenarios. Recent perception systems enhance spatial understanding with sensor fusion but often lack full environmental context. Humans, when driving, naturally employ neural maps that integrate various factors such as historical data, situational subtleties, and behavioral predictions of other road users to form a rich contextual understanding of their surroundings. This neural map-based comprehension is integral to making informed decisions on the road. In contrast, even with their significant advancements, autonomous systems have yet to fully harness this depth of human-like contextual understanding. Motivated by this, our work draws inspiration from human driving patterns and seeks to formalize the sensor fusion approach within an end-to-end autonomous driving framework. We introduce a framework that integrates three cameras (left, right, and center) to emulate the human field of view, coupled with top-down bird-eye-view semantic data to enhance contextual representation. The sensor data is fused and encoded using a self-attention mechanism, leading to an auto-regressive waypoint prediction module. We treat feature representation as a sequential problem, employing a vision transformer to distill the contextual interplay between sensor modalities. The efficacy of the proposed method is experimentally evaluated in both open and closed-loop settings. Our method achieves displacement error by 0.67m in open-loop settings, surpassing current methods by 6.9% on the nuScenes dataset. In closed-loop evaluations on CARLA's Town05 Long and Longest6 benchmarks, the proposed method enhances driving performance, route completion, and reduces infractions.


Introduction
The ecosystem of autonomous driving involves the perception and planning modules to complement each other for a smooth course of action (Yurtsever et al., 2020).To this end, two approaches-modular (Azam et al., 2020) and end-to-end autonomous driving (Xiao et al., 2020;Khan et al., 2022)-have been adopted in academia and industry as possible solutions for perception and planning modules.Although both approaches have their pros and cons, on the level of scalability, end-to-end autonomous driving is more promising in contrast to the modular approach.End-to-end autonomous driving bridges the perception and planning in a unified framework providing a differentiable stack for learning the driving policies (Schwarting et al., 2018).
Extracting meaningful spatial and temporal scene representation is vital for learning better driving policies in end-to-end autonomous driving.In the literature, various sensor modalities, such as single-camera systems and lidar, have been employed to capture environmental details.Techniques have been developed to extract both spatial and temporal information from these modalities (Huang et al., 2020;Behl et al., 2020).Among these, sensor fusion techniques stand out, integrating data from multiple sensors to achieve a comprehensive understanding.However, obtaining a complete understanding of a complex and dynamic environment-which involves interactions between multiple dynamic agents and features across different views or modalities-necessitates a global context.This global context aids in establishing a contextual representation across different modalities.
This work explores the formulation of driving policies by integrating multi-camera perspectives with top-down bird-eye-view (BEV) semantic maps to provide a holistic view of the environment.This integrative technique aims to refine the perception capabilities within an end-to-end autonomous driving framework.Our framework initiates by extracting and then fusing features from the input data, applying a transformer network to correlate these features with the subsequent control commands for the vehicle.The autonomous vehicle's waypoints are informed by the insights gained from the transformer.To validate our proposed method, we have conducted evaluations in both open-loop and closed-loop contexts, employing the nuScenes dataset and the Town05 Long and Longest6 Carla benchmarks, respectively.The results indicate that our approach outperforms existing state-of-the-art methods in the precision of driving policy prediction.
The main contribution of this work are: 1. Designing a framework that demonstrates an integration of spatial perception through RGB cameras with a top-down bird's-eye view (BEV) for contextual mapping.This dual approach mimics human-like perception by combining immediate visual data with a global understanding of the environment, enhancing the autonomous system's ability to navigate complex scenarios 2. Develop a transformer-based encoder to sequence the spatial and contextual features, leading to an improved feature representation for learning the driving policies.
The rest of the paper is organized as follows: Section 2 covers the literature review.Section 3 presents the proposed framework and Section 4 presents experimentation, analysis and results.Finally, Section 5 concludes the paper with possible directions of future work.

Related Work
2.1 Multi-modal End-to-end Learning Frameworks for Autonomous Driving Learning optimal trajectories involve a better representation of the environment to include spatial, temporal, and contextual information of the environment.Different multi-modal end-to-end driving methods are developed in the literature to improve driving performance.These multi-modal methods either use cameras, Lidar, HD maps, or sensor fusion between these information modalities.Xiao et al. (2020) have used the sensor fusion between RGB cameras and depth information to investigate the use of multi-modal data compared to single modality for end-to-end autonomous driving.Some works have focused on semantics and depth for determining the explicit intermediate representation of the environment and their effect on autonomous driving (Behl et al., 2020;Zhou et al., 2019).In addition, some works, for instance, NMP (Zeng et al., 2019), have used the Lidar and HD maps first to generate the intermediate 3D detections of the actors in the future and then learn a cost volume for choosing the best trajectory.Lidar and camera fusion are extensively used for perception and obtaining driving policies.Sobh et al. (2018) have used the Lidar and image fusion by processing both sensor modality streams in a separate branch and then fusing the resulting features.Further, they have applied semantic segmentation and Lidar post-processing Post Grid Mapping to increase the method's robustness.Similarly, Prakash et al. (2021) have fused the Lidar and camera data at multiple levels through self-attention for learning the driving policies.In addition, some methods have adopted sensor fusion between camera and semantic maps (Natan and Miura, 2022) for learning end-to-end driving policy for autonomous driving.Several studies have investigated the application of knowledge distillation techniques to learn driving policies.In this approach, a privileged agent is initially trained with access to comprehensive information, such as maps, navigational data, and images.Subsequently, this privileged agent is employed to train a sensorimotor agent, which only has access to image data (Chen et al., 2020;Zhang et al., 2023).Furthermore, improving the decoder architecture in an encoder-decoder architecture is also being explored by (Jia et al., 2023).All these methods have used sensor fusion techniques to acquire the spatial or temporal information of the environment but lack contextual information in terms of BEV semantic maps.In the proposed work, we have opted for BEV semantic maps and incorporated them with a camera stream to answer whether the inclusion of BEV semantic maps improves driving performance.

BEV Representation End-to-end Autonomous Driving
Representing the environment in a BEV benefits the planning and control task as it circumvents the issues like occlusion and scale distortion and also provides the contextual representation of the environment.In this context, some works focus on generating the BEV representation; for instance, ST-P3 leverages spatial-temporal learning by designing an egocentric-aligned representation of BEV and finally uses that representation for perception, planning, and control (Hu et al., 2022).Hu et al. (2023) have designed an end-to-end planning autonomous driving framework.This framework's perception and prediction modules are structured as transformer decoders, with task queries acting as the interface between these two nodes.An attentionbased planner is used to sample the future waypoints by considering the past node's data.Following the same approach, Jiang et al. (2023) have used a vectorized representation for end-to-end autonomous driving.They have adopted a BEV encoder for BEV feature extraction combined with map and agent queries in a transformer network for environment representation and then a planning transformer for predicting the trajectories.In addition, Chitta et al. (2021) have proposed a neural attention field for waypoint prediction.All these methods have used the BEV representation, similar to our work, but are more focused on how to make the BEV representation from input images; however, in our work, we focus on how to use the BEV features for learning the policy rather make BEV from input images and then use it for the planning.The experimental analysis shows the efficacy of our proposed method against state-of-the-art methods illustrating the effectiveness of using BEV representation for learning the driving policies in both open and closed-loop settings.

Transformer in End-to-End Autonomous driving
Initially used for natural language processing tasks (Vaswani et al., 2017), transformers have widely been employed for learning meaningful representation in vision applications (Dosovitskiy et al., 2020;Carion et al., 2020).The transformer's self-attention module enhances the learning of sequential data globally and improves feature representation.Prakash et al. (2021) employed the transformer to combine intermediate features representation from RGB images and Lidar data.Shao et al. (2022) proposed an encoder-decoder transformer network for predicting the waypoints using fusion between different view-point cameras and Lidar to learn the global context for comprehensive scene understanding.Huang et al. (2022) design a transformer-based neural prediction framework that considers social interactions between different agents and generates possible trajectories for autonomous vehicles.Dong et al. (2021) determines the driving direction from visual features acquired from images by using a novel framework consisting of a visual transformer.The driving directions are decoded for human interpretability to provide insight into learned features of the framework.Finally, Li et al. (2020) considers social interaction between agents on the road and forecasts their future motion.The spatial-temporal dependencies were captured using a recurrent neural network combined with a transformer encoder.

Method
An overview of the proposed method is depicted in Figure 1.This method encompasses both a perception module and a waypoint prediction module, collectively facilitating the learning of end-to-end driving policies.The perception module extracts feature from input sensor modalities and then passes them to the waypoint prediction module to generate future way-points/trajectories.The following sections detail the problem formulation, and model architecture, explaining the perception and waypoint prediction modules.

Problem Formulation
In this work, an end-to-end learning approach is adopted for the point-topoint navigation problem, where the objective of the trained agent is to safely reach the goal point by learning a driving policy π * that imitates the expert policy π.The learned policy completes the given route by avoiding obstacles and complying with the traffic rules.In the closed-loop settings, we have opted for the CARLA simulator to collect the expert dataset in a supervised learning approach.Similarly, to use the expert data in open-loop settings, we have used the nuScenes dataset.Suppose the dataset D = (X j , Y j ) of size d is collected that consists of high dimensional observations vector X from the sensory modalities along with the corresponding expert trajectories vector Y .The expert trajectories are defined in vehicle local coordinate space and are set of 2D waypoints transformed that is, , where u t and v t are the position information in horizontal and vertical directions, and T corresponds to the future horizon for the waypoints, respectively.The objective is to learn the policy π with the collected dataset D in a supervised learning framework with the loss function L expressed as follows arg min (1) In this urban setting, the high-dimensional observations include the center, right, left cameras and top-down BEV semantic data.

Model Architecture
This section explains the proposed method model architecture which is composed of two main modules i) perception and ii)waypoint prediction module.The perception module accepts the sensor modalities information for the feature extraction which is then passed to the waypoint prediction module for the prediction of waypoints using an auto-regressive model.

Perception Module
The perception module of the proposed method includes the backbone architecture and transformer encoder network.The following subsections explain the details of the perception module.
Backbone: In this module of the proposed method, we have to build a spatiotemporal representation of the environment.The sequence of input RGB images from all three views, that is, center (I c ∈ R 3×H×W ), right (I r ∈ R 3×H×W ), and left (I l ∈ R 3×H×W ) having width W and height H are processed by the backbone network for the feature extraction.ResNet architecture is employed for feature extraction in the proposed method, as it is efficient for feature representation.Similarly, the BEV semantic maps (M ∈ R H×W ) are passed to the ResNet network for feature extraction.In the case of closedloop settings, these BEV semantic maps are collected through the simulator, whereas in the case of open-loop settings, the BEV semantic maps are generated from the nuScenes dataset, as explained in Section 3.4.2.In our settings, the pre-trained ResNet model, trained on the ImageNet dataset, is utilized for generating the low-resolution feature maps f ∈ R C×H×W for each sensor modality.The last layer of the ResNet model for each sensor modality generates (B, 512, 8, 8) dimension feature maps, where B corresponds to batch size.The resolution of these resulting feature maps is reduced to the dimension of (B, 512, 1, 1) by average pooling and flattened to a 512 dimensional vector.To input the features into the transformer encoder, a projection layer is used to transform the 512 dimensional vector into a 400 dimensional vector.Finally, all the 400 dimensional vectors from each sensor modality that includes center f ′ c , right f ′ r , left f ′ l cameras, and top-down BEV semantic maps f ′ M are concatenated to give the final 1600 dimensional vector, which is then reshaped to (B, 1, 40, 40) to be utilized by the transformer encoder.The following expressions in (2) summarize the computation of features maps in the backbone network for each sensor modality and top-down semantic maps, where H, W = 256 respectively for each sensor modality and top-down BEV semantic maps.The expressions mentioned above only illustrate the computation of feature extraction for a single batch.In our experiments, we have used the batch size of 64 to train the proposed method.
Transformer Encoder: In this work, a transformer encoder, specifically a vision transformer, is employed to learn the contextual relationship between the features and to generalize it to learn better feature representation.In this context, the resulting features f = R 1×H×W is fed to the transformer encoder by flattening into patches f p = R N ×(P 2 C) , where H and W corresponds to the resolution of input features from the backbone network, C is the number of channels, (P, P ) is the size of each patch, and N = HW/P denotes the number of patches and also the input sequence length.In addition, a learnable position embedding is added to the input sequence, a trainable parameter with the same dimension as the input sequence, so that the network infers the spatial dependencies between different tokens at the train time.A velocity embedding is also added to the C dimensional of the input sequence through a linear layer, which includes the current velocity.Finally, the input sequence, positional embeddings E pos , and velocity embeddings E vel are element-wise summed together,which is mathematically expressed in the following, where MSA corresponds to multi-head self-attention, MLP is multi-layer perceptron, LN is layer normalization, and D corresponds to dimension.The multi-head attention helps in generating the rich feature representation for the input sensor modalities that in turn to learn better contextual representation.The formulation of the multi-head self-attention is expressed as, where Q, V and K are the query, value and key vectors and W is the weight matrix.The output features from the MSA have the same dimensionality as the input features.The transformer encoder applies the attention multiple times throughout the architecture.The final output features from the transformer encoder are then summed along the dimension to produce the 16 dimensional vector having the contextual representation of features from all the sensor modalities.This resulting 16 dimensional feature vector is injected into the waypoint prediction module to predict waypoints.

Waypoint Prediction Module
The waypoint prediction module acts as a decoder for predicting future waypoints using the encoded information from the transformer encoder.The resulting 16 dimensional vector is passed through an MLP consisting of two hidden layers having 256 and 128 units, respectively, to output the 64 dimensional vector.The MLP layer is used for upsampling the vector dimension from 16 to 64 and is related to experimental heuristics that produce better results in terms of waypoint prediction.We have employed the auto-regressive GRU model to predict the next waypoints that take the 64 dimension feature vector to initiate the hidden state of the GRU model.The GRU-based autoregressive model takes the current position and goal location as high-level commands as input, which helps the network focus on the relevant context in the hidden states to predict the next waypoints.In the case of closed-loop settings, the goal locations include the GPS points registered in the same ego-vehicle coordinate frame as input to the GRU rather than the encoder because of the colinear BEV space between the predicted waypoints and the goal locations.However, high-level commands such as forward, turn right and left are passed as input to the GRU for waypoint predictions in the open-loop settings.
In the open-loop settings, we have evaluated the predicted trajectory with the ground-truth trajectory without using a controller.However, for the closed-loop setting, the predicted waypoints are passed to the control module of the CARLA simulator to generate steer, throttle, and brake values.Two PID controllers for lateral and longitudinal control are used in this context.The longitudinal controller takes the average weighted magnitude of vectors between the waypoints of consecutive time steps, whereas the lateral control takes their orientation.For the control settings, we have used the settings as suggested by the following codebase benchmarked on the CARLA dataset.

Experiments
This section explains the proposed method evaluation in both open-loop and closed-loop settings.The nuScenes dataset is utilized for the openloop evaluation, whereas the CARLA simulator is used for the closed-loop evaluation.

Open-loop Experiments on nuScenes Dataset:
The nuScenes dataset contains 1k diverse scenes comprising different weather and traffic conditions.Each scene is 20 seconds long and contains 40 frames, corresponding to a total of 40k samples in the dataset.The dataset is recorded using a camera rig comprised of 6 cameras on egovehicle, giving a full 360 deg view of the environment.The dataset includes the calibrated intrinsic K and extrinsic (R, t) for each camera view at every time-step.The proposed method settings utilize the center, right, and left camera views.Since the nuScenes dataset does not provide any top-down BEV semantic representations, the BEV semantic representation is generated using ego-vehicle poses and camera views intrinsic and extrinsic calibration data.Input Representations: For the nuScenes dataset, the input image from the center, front, and left camera views are first cropped and resized to 256 × 256 from the original resolutions of 900 × 1600.Contrary to the camera views and ego-vehicle future positions data, the nuScenes data does not provide the top-down BEV semantic maps.Since the proposed method requires the BEV semantic maps, we have generated those map labels using the ego poses and camera calibration data.To do so, for each camera, the camera's intrinsic K matrix and extrinsic T V C camera-to-vehicle calibration matrix is obtained from the nuScenes dataset.Similarly, the ego pose transformation matrix T W V vehicle-to-world, which includes rotation matrix R and translation vector t, are retrieved from the dataset.The BEV transformation matrix T BEV W is obtained by taking the product between T V C , R, and T W V .The 3D point or object is projected by first representing the 3D point or object as a homogeneous vector P w [X w , Y w , Z w ] T .Then, the 3D point or object is projected onto the BEV space by taking the product between the BEV transformation matrix T BEV W and the homogeneous vector P w [X w , Y w , Z w ] T , resulting in the 2D BEV coordinate P bev .Finally, the segmentation map assigns class labels to the BEV pixel coordinates, giving the final BEV semantic mapM .In our settings, we kept the resolution of this BEV semantic map to 256 × 256.Output Representations: The proposed method predicts the future trajectory Y , for the ego-vehicle in the ego-vehicle coordinate.In the open-loop settings, the future trajectory Y is represented as waypoints that include position information.In our experiments, by default, the horizon T = 2.0s is set for predicting the future trajectory by taking the past 1.0s past context.Evaluation Metrics: For the proposed method evaluation, Euclidean distance (L2 error) is used which is the measure of distance between the expert trajectory and the predicted trajectory.Mathematically, the L2 error is defined by the Eq.5 (5) where, T e and T p correspond to the expert and predicted trajectory, respectively.Each trajectory consists of n points in a d-dimensional space.
4.2 Closed-loop Experiments on CARLA Dataset: In this work, CARLA 0.9.101 simulator is used to create a dataset for training and evaluation.Table 1 illustrates the dataset details that are utilized in generating the training dataset to create a more varying simulation environment.For generating the dataset, an expert policy with the privileged information from the simulation is rolled out to save the data at 2FPS.The dataset includes left, right, and center camera RGB images, topdown semantic map information, the corresponding expert trajectory, speed data, and vehicular controls.The trajectory includes 2D waypoints transformed into BEV space in the vehicle's local coordinate, whereas the steering, throttle, and brake data are incorporated into the vehicular control data at the time of recording.Inspired by Prakash et al. ( 2021) configurations, we have gathered the data by giving a set of predefined routes to the expert in driving the ego-vehicle.The GPS coordinates define the routes provided by the global planner and high-level navigational commands (e.g., turn right, follow the lane, etc).We have generated around 60 hours of the dataset, including 200K frames.
Input Representation: The proposed method utilizes two modalities: RGB cameras (left, center and right) and semantic maps.The three RGB cameras provide a complete field of view that mimics the human field of view.The semantic maps are converted to BEV representation that contains ground-truth lane information, location, and status of traffic lights, vehicles, and pedestrians in the vicinity of ego-vehicle.The top-down semantic maps are cropped to the resolution of 256 × 256 pixels.For all three cameras, to cater the radial distortion, the resolution is cropped to 256 × 256 from the original camera's resolution of 400 × 300 pixels at the time of extracting the data.
Output Representation: For the point-to-point navigation task, the proposed method predicts the future trajectory Y of the ego-vehicle in the vehicle coordinate space.The future trajectory Y is represented by a sequence of 2D waypoints, Y = y t = (u t , v t ) T t=1 , where u t and v t are the position information in horizontal and vertical directions, respectively.In the experimental analysis, we have utilized T = 4 as the number of waypoints.Evaluation Metrics: The proposed method's efficacy is evaluated using the following metrics indicated by the CARLA driving benchmarks.
Route Completion: is the percentage of route distance R j completed by the agent in route j averaged across the number of N routes is shown in the form, The RC is reduced if the agent drives off the specified route by some percentage of the route.This reduction in RC is defined by a multiplier (1-% off route distance).Infraction Multiplier: as shown in ( 7) is defined as the geometric series of infraction penalty coefficient, p i , for every infraction encountered by the agent along the route.Initially, the agent starts with the ideal base score of 1.0, which is reduced by a penalty coefficient for every infraction.The penalty coefficient p i for each infraction is predefined.If the agent collides with the pedestrian p pedestrian , the penalty is set to 0.50; with other vehicles p vehicles , it is set to 0.60, 0.65 for collision with static layout p stat , and 0.7 if the agent breaks the red light p red .The penalty coefficient is defined as P C = p pedestrian , p vehicles , p stat , p red , Driving Score: is computed by taking the product between the percentage of the route completed by the agent R j and the infraction multiplier IM j of the route j and averaged by the number of the routes N r .Higher driving score corresponds to the better model.Mathematically, the driving score (DS) is It is to be noted that if the ego vehicle deviates from the route j for more than 30 meters or there is no action for 180 seconds, then the evaluation process on route j will be stopped to save the computations cost and next route will be selected for the evaluation process.

Training Details
The proposed method is trained using the dataset collected from the CARLA simulator by rolling out the expert model and also on the nuScenes dataset.In addition, we have used the pre-trained ResNet model trained on the ImageNet dataset to extract the features in the backbone network for each sensor modality.In training the proposed network, we have added augmentation such as rotating and noise injection to the training data, along with adjusting the waypoints labels.For the transformer encoder, we have used the patch size of 4, which gives the 16 dimensional feature embedding.We have trained the proposed method using the Pytorch library on RTX 3090 having 24 GB GPU memory for a total of 100 epochs.In training, we have used the batch size of 64 and an initial learning rate of 10 −4 , which is reduced by a factor of 10 after every 20 epochs.The L 1 loss function is used for training the proposed method.Let y gt t represent the groundtruth waypoints from the expert for the timestep t; then the loss function is represented as An AdamW optimizer is used in training with a weight decay set to 0.01 and beta values to the Pytorch defaults of 0.9 and 0.99 (Yao et al., 2021).

Results
Open-loop Experimental Results on nuScenes: The proposed method is evaluated on the L2 evaluation metric against the state-of-the-art methods for the quantitative analysis, as illustrated in Table 2.In our experiments,  (Hu et al., 2021) 0.55 1.20 2.54 1.43 ST-P3 (Hu et al., 2022) 1.33 2.11 2.90 2.11 UniAD (Hu et al., 2023) 0.48 0.96 1.65 1.03 VAD-Base (Jiang et al., 2023)  we have deactivated the ego-status information in the open-loop settings and also fixed the planning horizon T = 3.0s to make a fair comparison with the state-of-the-art methods.Since the L2 error corresponds to the displacement error in meters between the predicted and ground-truth trajectories, the lower the displacement error, the better the model.The proposed method illustrates better performance as compared to state-of-the-art methods.The comparative analysis uses camera-centric and Lidar-based end-to-end learning methods to predict trajectory.For instance, NMP uses the Lidar and HD maps for predicting future trajectories, giving the L2 error of 3.18m.Then NMP model is only evaluated for the planning horizon of 3.0s.Similarly, the FF method predicts the future trajectory based on free-space estimation having the L2 error of 2.54m at the planning horizon of 3.0s and an average L2 error of 1.43m.The proposed method illustrates lower L2 error at the planning horizon of 3.0s and on average compared to NMP and FF methods.Similar to our work, the baseline methods that follow the BEV representation are ST-P3, UniAD, and VAD-Base.The L2 error for the ST-P3, UniAD, and VAD-Base are 2.90m, 1.65m, and 1.05m, respectively, at the planning horizon of 3.0s, where the proposed method has L2 error of 1.01m at the same planning horizon, outperforming the ST-P3, UniAD, and VAD-Base by 89.5%, 38.8%, and 3.8% respectively.Similarly, on average, the proposed method shows lower L2 error than the state-of-the-art methods.
Table 4 shows the proposed method results with other state-of-the-art methods on the Longest6 benchmark in closed-loop settings.The proposed method achieves the driving score of 67.43 ± 2.3, 80.54 ± 1.5 of route completion, and 0.81 ± 0.05 of infraction scores on the Longest6 benchmark, outperforming the other state-of-the-art methods in evaluation metrics in the closed-loop settings.Figure 3 and 4 illustrate the proposed method's qualitative results on Town05 and Longest6 benchmarks in various driving scenarios.The learned driving policy through the proposed method is displayed in moving straight, stopping at the traffic light, and making left, and right turns.These results demonstrate that the driving policy learned using the proposed method show promising results and complements the quantitative analysis of the proposed method with other state-of-the-art baseline methods.

Conclusion
In this work, we explore the use of contextual information for learning driving policies in an end-to-end manner for autonomous driving.Drawing inspiration from the human neural map representation of the environment, we employ three RGB cameras coupled with a top-down semantic map to achieve a holistic understanding of the surroundings.This environmental representation is then channeled through a self-attention-based perception module, subsequently processed by a GRU-based waypoint prediction module for generating the waypoints.The proposed method is experimentally evaluated for both open-loop and closed-loop settings, illustrating better performance than state-of-the-art methods.
Building on this foundation, there are avenues for further exploration.While our current framework primarily relies on RGB cameras and a se- mantic map for environmental perception, future research could benefit from incorporating additional sensors, such as radar and LiDAR, to enhance the perception module.An intriguing area of investigation remains: how to refine the contextual representation of the environment for driving policy predictions, especially when integrated with neural-network-based controllers.This presents a promising direction for advancing the capabilities of autonomous driving systems.

Figure 1 :
Figure1: The architecture of the proposed method which is comprised of two modules: perception block and waypoint prediction block.The perception module generates the features extracted from the input three RGB cameras (center, left, right) and the top-down semantic maps.These extracted features are then embedded with the velocity information to be utilized by the transformer encoder.The encoded features are then passed to the GRU-based waypoint prediction module for the generation of next waypoints.(Bestview  in color)

Figure 2 :
Figure 2: Qualitative results for the proposed method in different driving conditions using nuScenes dataset in open-loop evaluation.

Figure 3 :
Figure 3: Qualitative results for the proposed method in different driving conditions on Town05 Long benchmark.

Figure 4 :
Figure 4: Qualitative results for the proposed method in different driving conditions on Longest6 benchmark.

Table 1 :
Dataset generation details using the CARLA simulator for the proposed method

Table 2 :
Quantitative comparison between the proposed method and the state-of-the-art baseline methods in open-loop settings using the nuScenes dataset.A lower L2 error indicates better performance.

Table 3 :
Comparison of proposed method with state-of-the-art methods on Town05 Long benchmark in terms of driving score (DS), route completion (RC) and infraction score (IS).* indicates the respective method reports the score on normal all weather conditions, and † corresponds to adversarial all weather conditions.

Table 4 :
Comparison of proposed method with state-of-the-art methods on Longest6 benchmark in terms of driving score (DS), route completion (RC) and infraction score (IS).