BEV-Radar: Bidirectional Radar-Camera Fusion for 3D Object Detection

: Exploring Millimeter Wave Radar data as complementary to RGB images for ameliorating 3D object detection has become an emerging trend for autonomous driving systems. However, existing Radar-camera fusion methods are highly dependent on the prior camera detection results, which renders the overall performance unsatisfactory. In this paper, we propose a bidirectional fusion scheme in the bird-eye view (BEV), which is independent of prior camera detection res-ults. Leveraging features from both modalities, our method designs a bidirectional attention-based fusion strategy. Specifically, following BEV-based 3D detection methods, our method engages a bidirectional transformer to embed information from both modalities and enforces the local spatial relationship according to subsequent convolution blocks. After embedding the features, the BEV features are decoded in the 3D object prediction head. We evaluate our method on the nuS-cenes dataset, achieving 48.2 mAP and 57.6 NDS. The result shows considerable improvements compared to the camera-only baseline, especially in terms of velocity prediction. The code is available at https://github.com/Etah0409/BEV-Radar}{https://github.com/Etah0409/BEV-Radar


Introduction
The perception system in autonomous driving is usually equipped with different types of sensors. Complementary multi-modal sensors avoid unexpected risks but take on new challenges while sensor fusion. Recent works focused on visual sensors [1] , typically providing dense and redundant information. However, visual sensors are usually not stable enough for adverse weather conditions(i.e. rain, snow, and fog). In addition to the high cost, the fusion of visual sensors cannot fully sustain the perception system in variable autonomous scenarios, which requires robustness.
Aside from lidar and camera, radar also has been widely used in autonomous scenes as speed measurement and auxiliary location prediction, but rarely on visual tasks due to its physical nature. While stability and penetration benefit from their physical properties, sparse results, noisy features and lack of vertical information are crucial problems brought by frequently-used automotive radar. Randomly scattered signals among vehicles, buildings, and obstacles get highly specular reflectivity and multi-path effects. While the complementary characteristics of camera and radar are effective, the fusion strategy faces several challenges. First, the results of the mm-wave radar projected on the image view only have direction and range, which does not provide vertical information and leads to some bias when projected on the camera view. Moreover, the image cannot rely merely on the projec-ted radar depth, as multi-path effectivity produces inaccurate results for radar detection.
Compared to the richer and more accurate information provided by visual sensors, the alignment of features between the camera and the radar is a challenging problem. Without vertical information, some methods [2,3] rectify vertical direction in front view after projecting radar points to image planes. Higher-performing leverage on first-stage proposals from the camera, and then constructs soft association between objects and features according to the extrinsic matrix, as shown in Figure 1. Instead of association methods, transforming both features to BEV can extremely relieve the problem, concerning two key points: a more compatible decoupled fusion strategy for radar data, and a better promotion for both modalities.
Inspired by BEV fusion methods [4,5] , we implement BEV-Radar, an end-to-end fusion approach for radar and camera, which can be conveniently used for other BEV for camera baselines. Before fusion, radar encoders for pillar extraction and tensor compaction. BEV-Radar focuses on inserting dense radar tensors into the BEV image features generated by the camera baseline. Bidirectionally, radar features and image features are promoted to their respective decoders according to cross-attention. Despite the simplicity of the basic idea, the evaluation on the nuScenes dataset performs outstanding results in the 3D object detection benchmarks. It achieves an improvement over the camera-only baselines and performs well even compared to other radar-camera fusion studies. Besides, for the original intention of the experiment, radar fusion behaves stably with +10%mAP and +15%NDS boost in adverse scenes.
In conclusion, we make the following contributions: I. We construct an end-to-end BEV framework for radar and camera fusion. Instead of relying on the first-stage detection results provided by the camera, this integral network instructs a portable and robust type that does not depend strongly on the camera. II. We propose a novel bidirectional fusion strategy compared to vanilla cross attention, which is suitable for multimodal features with spatial relationships. It performs effectively despite the huge diversity of radar and camera.
III. We achieve a comparative camera-radar 3D detection performance on nuScenes. Compared to a single modality, we solve the hard problem of velocity prediction, which is nontrivial in autonomous.

Related Work
Camera-only 3D Detection Monocular 3D detection re-quires the estimating of 3D bounding boxes while using a monocular camera. The key question is how to regress the depth information on the 2D view. Earlier works rely on 2D detection networks with additional sub-networks for 3D projection [6,7] . Several works have attempted to convert RGB information into 3D representations, such as Pseudo-Lidar [8,9] and OFT [10] . Several studies [11] introduce keypoint detection for centers and use 2D object detection prediction as regression auxiliary. In recent works, camera-only methods directly predict results on 3D spaces or BEV features [5,12,13] . They operate directly on the BEV features transformed from the front view according to calibration.

Camera-fusion 3D Detection
The key point of association modality fusion methods is to find the interrelated spatial relationships among multi-modal sensors. In recent years, fusion approaches have mainly focused on lidars and cameras. Some earlier works [14,15] map the data from multi-views into unified types like image or BEV. Pointpainting [1] creatively proposes the segmentation of information from images onto point cloud. Due to the sensitivity to adverse weather conditions, MVDNet [16] first designs a fused area-wise network for radar and lidar in a foggy simulated environment. Motivated by the cost of lidar, [17] researches the improvement of fusion on tiny objects with camera and radar, and [18] introduces the  Transformer for feature-level fusion. However, the 2D convolution of the projected radar points comes with useless computations and does not take into account the sparsity of the radar. Restricted by the front view, spatial relationships between different modalities rely on the results predicted during the first stage. By transforming features from their respective views to a unified BEV, BEVFusion [4] predicts the depth probabilities for image features and projects the pseudo-3D features to the BEV based on their extrinsic parameters. Transfusion [19] compresses camera features along the vertical axis to initialize the guiding query, and align results of the first stage back to image planes.

Approach
In this work, we present BEV-Radar, a radar-camera fusion framework based on camera-only 3D object detection. As shown in Figure 2, given a set of multi-view images and sparse radar points as inputs, we extract respective BEV features separately, and then decode the features using bidirectional attention modules as inserted fusion decoders called BSF (Bidirectional Spatial Fusion). Instead of simple crossattention, BSF performs better fusion for both modality features and aligns features from different domains effectively.
In the following subsections, we first review the preliminaries for related tasks and then elaborate on the implementation details of the BSF.

Generation of BEV Features
Traditional sensor fusion operates on separate views so that the perspective front view and BEV are aligned on the actual pixel-to-pixel spatial relation. However, even with a high-precision extrinsic calibration, projected radar points deviate from the true positions due to the absence of vertical information. Moreover, this pixel-topixel spatial alignment is not tight enough due to the geometric distortion of the camera and the sparse attributes of the 3D points. Therefore, a unified BEV representation instead of a geometric association is crucial for sensor fusion.

Transformers for 2D Detection
ViT [20] (Vision in Transformer) proposes patched images with positional encoding instead of 2D convolution, which makes progress on image feature based on original NLP [21] . The original attention mechanism is formulated as follows. Given a query embedding , key embedding , and value embedding , these inputs will be computed in a single-head attention layer as: As for the prediction decoder, the promoted DETR [22] transformers [13,23,24] are widely used for detection tasks based on reforming a set of matched bounding boxes. Thus the usual 3D regression problem is transformed into a bipartite matching problem and the Non-Maximum Suppression (NMS) algorithm is no longer needed.

BEV Unified Representation
In this part, we state the details of the representation of the two sensors. Transforming raw features extracted from their original data type to BEV is nontrivial for alignment.
To Camera: Following BEVDet [5] , the BEV camera baseline predicts the depth of multi-view image semantic features from the backbone and feature pyramid network and then transforms all features into a unified BEV grid space relying on the associated extrinsic matrix. Thus the baseline forms a BEV camera feature map , which is downsampled from origin size by 8×, and H, W describes the size of the BEV map. BEV image features provide a global representation for multi-view transformations.

R ∈ R N×d×X×Y
To Radar: The radar data format has a completely different style compared to the camera, similar to Lidar but sparser, with about 300 points per 6 frames. To avoid overly sparse inputs, a sequence of points is accumulated, where X and Y denote spatial coordinates, d denotes attributes including velocity, and N is the size of the point set. In the absence of vertical information, pillars [25] as feature extrac- tion considerably alleviates the computation of sparse radar data to traverse the BEV plane. Naturally, the unified BEV radar features are formed after a linear transformation.

Bidirectional BEV Alignment
Traditional sensor fusion first concatenates individual features directly and then uses attention or convolution blocks to extract features from different modalities and align them according to their spatial relationships. However, for BEV radar and image features, sparsity makes it non-trivial to align both modalities spatially only, so we need to generalize each sparse feature. In this section, we instruct a module consisting of cross-attention and convolution blocks to progressively embed the duplex features in each other, which results in better alignment.
Specifically, a block consists of two parts: an interaction module to communicate each feature, and a convolutionbased fusion operation. As shown in Figure 2, the fusion part can be divided into N equal blocks and the positional embedding operation is applied before fusion. For the camera branch in the -th block, given a dimensional camera BEV feature map as query, the radar BEV feature map is used as key and value, and vice versa for the radar branch. We use deformable cross-attention [24] to remedy the computational cost caused by the sparsity of BEV features, which can be formulated as follows: Different from the NLP vanilla transformer, spatial information obtaining objects' location is vital for detection tasks . Designed for 2D structures, convolution kernels are better at extracting local spatial correlations than 1D attention.
is transformed to image style again and sent to convolution blocks, which are then patched again before the next th block. At the same time, a transform block for remains synchronized with for , and they are returned separately as the next inputs. In this way, multi-blocks increase the fitness of and , while bidirectional design updates obtain the alignment of feature domains. In each block, convolution layers are required to extract the local spatial relations, see Sec3.2.2 for a related verification.

Prediction Heads and Losses
The BEV fusion features are applied to 3D object detection prediction heads. Referring to Transfusion [19] , we simply use the class embedding heatmap transformed from the fusion features as query initialization to predict the centers for all objects in each scene. A vanilla transformer is used as the decoder for DETR [22] prediction parts through the Hungarian algorithm [26] , and we set the regularized matching loss function by a weighted sum of the classification, regression, and IoU calculation: where , , represent each coefficient parameters, and , , are individual loss function for above.

Implementation Details
Training This end-to-end work is implemented on the opensourced MMDetection3D [27] in PyTorch. For nuScenes, following M 2 BEV [28] , we use a pre-trained model on NuImage with Swin-transformer as the camera backbone. BEVDet [5] is chosen as our image BEV baseline, and the settings keep the same. Considering the synchronization of the real systems, we accumulate a sequence of radar points around 6 frames to resolve its sparsity and eliminate the effect of ego-motion. Pillars [25] are generated from radar points and scattered onto the BEV grids as a pseudo image. The pillar size is set to (0.1m, 0.1m, 8m). We adopt the random flip, the rotate strategies, and the CBGS dataset configuration on the data augmentation. Optimization is set to AdamW with official weight decay and the learning rate is 0.0001 for 4 NVIDIA RTX A6000.
Testing As a result of the transformer detection head, we use all the outputs through a bipartite matching algorithm instead of traditional Non-Maximum Suppression (NMS). Evaluation metrics are formed as mean average precision (mAP) across 10 foreground classes and the nuScenes detection score (NDS) without test-time augmentation.

Experiments
In this section, we compare our work with other state-of-theart methods and fusion strategies on nuScenes first. Then evaluate the ablation studies for the modules designed. Moreover, we design extra variable weather conditions to compare our visual baseline to show its robustness against severe scenes. The visual baseline is BEVDet-Tiny for camera-only detection. By the way, As a separate fusion module, our framework can be easily extended to support more sensors.
nuScenes Dataset The nuScenes Dataset is a large-scale outdoor autonomous-driving dataset for various 3D visual tasks. For our work, particularly, nuScenes is the unique dataset whose sensors include radar while with 3D ground truth. It consists of 700, 150, and 150 scenes for training, validation, and testing individually. Particularly, 6 calibrated multi-view cameras cover the surround horizontal FOV with overlapping, while 5 calibrated radars are also distributed around ego on average. The model is evaluated according to mean Average Precision (mAP) and nuScenes detection score (NDS) for 10 common classes. Instead of 3D IoU, AP for nuScenes is defined as the BEV center distance, concerning 0.5m, 1m, 2m, 4m across 10 classes, and NDS is a weighted sum of mAP and other attribute metrics including velocity.

Main Results
Our model is submitted to the nuScenes evaluation server and achieves competitive performance on its metrics. As shown in Table 1, our model outperforms camera-only baselines by 17% mAP with a 10.2 FPS inference speed. In contrast to other camera-radar fusion methods, BEV-Radar achieves 7%mAP and 5% NDS boost in the test split. Moreover, the mAVE metric is attractive due to the complementary fusion motivation. However, accurate velocity prediction is quite difficult for single-frame camera methods. Thus radar fusion should remedy this burden for camera-only detection. Our results show a large improvement in velocity prediction of 14%-24% compared to other radar fusion models. In addition to evaluating the robustness, one of the complementary attributes provided by radar fusion, we design several experiments in different weather conditions, as shown in the next section. Table 2 shows the per-class mAP comparison with methods in various modalities on val set. With a similar performance of BEVDet-Tiny and CenterNet for camera-only baselines, our fusion work achieves significant progress for radar fusion. Considering velocity attribute, for dynamic types (i.e. Car, Truck, Bus, Pedestrian, Motorcycle, and Bicycle) and static (i.e. Barrier and Traffic Cone) classes divided separately, results also show a gap (14%-20%), since the valid radar points of dynamic objects are distinguished from background interference. More precisely, radar fusion performance is better for metallic and large objects (i.e. Car, Truck, and Bus) than non-metallic and small objects (i.e. Pedestrian and Bicycle), caused by RCS (Radar Cross Section). Accuracy for non-metallic and static objects is more camera dependent, and radar less helpful. In particular, though Trailers and Construction Vehicles belong to the large, metallic, and dynamic classes, it is hard to achieve comparable satisfactory performance as they occur infrequently in the nuScenes dataset.

Ablation Studies
We conduct several ablation studies on the nuScenes validation set to verify the effectiveness of the proposed components. Table 3 reports the improvement of BF or BSF under the different settings of the model depth and training epochs. a) The first row shows the result of the model which uses the concatenation followed by convolution layers instead of BS or BSF as the fusion module. b)-c) When the concatenation is removed and BF is added, mAP and NDS boost together as the number of layers increases. To compare BF and BSF, d) shows the performance of the gap with respect to the spatial policy. Specifically, we set the number of fusion blocks to 3, showing that the model performs better in this setting. a)-e) shows our BSF blocks work well for both modalities, though the features are not similar.
To support the idea that radar can enhance the accuracy of long-range object detection, Table 4 plots the improvement brought by radar fusion. The accuracy of the car gradually decreases as the distance increases on account of image resolutions. Our fusion method provides a performance boost for distant regions, where radar points are able to travel, but hardly for the visual camera. Even for objects with a distance of 40m, the AP of Car brings 20% gain. As we discuss in † δ Table 1. State-of-the-art comparison on nuScenes test set. 'L', 'C', and 'R' denote LiDAR, camera, and radar individually. represents test time augmentation. Particularly, BEVDet-Tiny is our BEV camera-only baseline, and CenterNet is for CenterFusion and CRAFT. represents a SECOND [29] network as decoder compared to base version.

Robustness against Weather Conditions
We design two experiments to demonstrate the robustness of the proposed fusion model. Since only three submissions are allowed for the evaluation of the nuScenes dataset for the test dataset, the special experimental dataset is selected from the validation dataset based on the description of each scenario. There are 5417, 602, and 1088 samples separately for the day, night, and raining times. All the mentioned parameters are kept the same as before except for dataset type. Table 5 shows the robustness to weather and sight conditions, with the performance for night (+10%) and raining (+12%) times providing an intuitive comparison.
The camera-only model is severely affected by the sight condition, which is reduced by 20%. The sight condition does not mean that prediction error occurs only in dark environments, but also in illuminated scenes due to the reflection of headlights, as shown in the left panel of Figure 3. On rainy times, the HD waterproof camera is almost unaffected by the rain, except when the raindrops fall right on the camera lens, as shown in Figure 3 right. The examples above have one thing in common: the radar can still work well, therefore accurate distance prediction boosts the fusion. Compared to the camera-only model, our fusion model brings a better performance mainly when the camera is working as normal, which means that the fusion between the camera and the radar, the camera determines the lower bound of the accuracy of the fusion model. As a non-visual sensor type, the task of 3D object detection is a challenge for sparse radar sensors alone. However, the detection results ignored by the camera but  provided by the radar fusion obtain an unsatisfactory confidence score, due to the nature of the radar that can not work alone but serve as a supplementary sensor.

Qualitative Results
The visualization of compared results between camera-only and camera-radar fusion models are shown in Figure 4. By contrast, the fusion method precisely refines the image proposals and completes the objects which aren't correctly recognized by the camera-only model. Apart from the intrinsic occlusion problem, although BEV-Radar performs well on BEV, height size and centers on the vertical axis are difficult for BEV methods, which shows miss-alignment when projected to perspective view.  Most of the objects are detected by the camera-only model, however, there are errors in either the center or the size. In particular, the radar-fusion model is better at orientation prediction due to the accurate velocity auxiliary. As we discussed in Sec.3.2.3, the radar-fusion method can remedy some errors made by the camera, but not all. Although some of the ignored objects are successfully detected by the radar, the fusion method gets lower confidence scores without camera judgments.

Conclusions
In this paper, we propose an end-to-end robust camera-radar 3D detection framework with a bidirectional transformerbased fusion strategy to adaptively embed radar features while preserving their spatial relationship. Unlike other radar fusion methods, which typically require prior 3D object detection from camera baselines, our approach does not completely rely on visual results, however, as complementary BEV features enhance visual sensors, thus is portable to other multi-sensor frameworks. Our work sets high-performance results on the nuScenes detection, and extensive experiments demonstrate the robustness of radar fusion. We also discuss the effectiveness of the sensor fusion on different weather or sight conditions that realistic system concerns, and hope that BEV-Radar will inspire practical applications.