RCRFNet: Enhancing Object Detection with Self-Supervised Radar–Camera Fusion and Open-Set Recognition

Robust object detection in complex environments, poor visual conditions, and open scenarios presents significant technical challenges in autonomous driving. These challenges necessitate the development of advanced fusion methods for millimeter-wave (mmWave) radar point cloud data and visual images. To address these issues, this paper proposes a radar–camera robust fusion network (RCRFNet), which leverages self-supervised learning and open-set recognition to effectively utilise the complementary information from both sensors. Specifically, the network uses matched radar–camera data through a frustum association approach to generate self-supervised signals, enhancing network training. The integration of global and local depth consistencies between radar point clouds and visual images, along with image features, helps construct object class confidence levels for detecting unknown targets. Additionally, these techniques are combined with a multi-layer feature extraction backbone and a multimodal feature detection head to achieve robust object detection. Experiments on the nuScenes public dataset demonstrate that RCRFNet outperforms state-of-the-art (SOTA) methods, particularly in conditions of low visual visibility and when detecting unknown class objects.


Introduction
Autonomous driving technology has seen rapid advancements in recent years, focusing mainly on three critical components: environment perception, decision-making, and control [1].Environment perception is the foundation of autonomous driving, performing functions such as 2D and 3D object detection, depth estimation, and prediction, similar to how human eyes work.This perception relies on real-time data gathered by onboard sensors.However, the data obtained from a single sensor are often insufficient for the complex tasks required for environment perception.
Common onboard sensors include vision cameras, LiDAR, and millimeter-wave (mmWave) radar.Vision cameras, while providing rich geometric information, are significantly impacted by lighting conditions and have a reliable detection distance limited to 50 m.LiDAR sensors provide high-resolution 3D mapping and precise distance measurements, making them excellent for detailed environmental modelling, but they are costly and less effective in bad weather [2].In contrast, mmWave radar emits active signals to measure the reflective information of objects.Although it offers less geometric detail, it provides a reliable detection distance of up to 250 m and has strong penetrability, making it ideal for harsh environments.Consequently, mmWave radar has become an indispensable choice for current automated driving applications.
Adverse weather conditions, including heavy rain and fog, present significant challenges to visual detection technology, leading to reduced object detection performance in models such as SSD [3], DETR [4], and Faster R-CNN [5].Additionally, the diversity of objects in real-world scenes often surpasses the categories covered by detection model training data, resulting in a model mismatch problem [6].Recently, there has been increasing interest in open-set detection and recognition methodologies [7,8].In [9], a deep learning approach is proposed for open-set recognition, which involves fitting extreme value distributions specific to each category.This method utilizes the class probabilities derived from these distributions to identify open-set images.In the realm of video action recognition, ref. [10] introduces deep evidence learning grounded in evidence theory, effectively addressing classification uncertainty in video actions.Moreover, ref. [11] introduces a reconstruction learning algorithm aimed at enhancing the robustness of detecting unknown classes.
Radar can reliably detect a wide range of object categories, including those not present in the training set, making it useful for detecting obstacles outside the training categories.As a result, the fusion of radar and camera technologies has gained considerable attention in recent years, aiming to leverage the strengths of both sensors.Millimeter-wave radar point clouds offer precise measurements of object distance and velocity [2].However, due to their low angular resolution and sparsity, they cannot independently perform complex object detection tasks or accurately identify an object's category and geometric information.Traditional mmWave radar point cloud processing methods, such as distance-FFT, Doppler-FFT, incoherent accumulation, and angle estimation, have limitations in achieving comprehensive object detection.Recent advancements include Radar-pointGNN [12], which uses graph neural networks for object detection, and methods that utilise raw radar tensor data for 3D object detection through graph neural networks [13].However, in order to extract features, these algorithms largely rely on large amounts of training data and prior knowledge.This poses difficulties when fusing with image data, which results in reduced robustness and increased complexity.Consequently, radar and vision machine learning fusion detection methods have gained increasing attention from researchers.These approaches aim to integrate the strengths of both radar and vision systems to enhance detection capabilities and overcome the limitations of individual sensor modalities.
Currently, there are three primary approaches for multimodal fusion [14]: early fusion (data-level) methods [15], intermediate fusion (feature-level) methods [16][17][18][19], and late fusion (decision-level) methods [20,21].In data-level fusion, the region of interest (ROI) is first generated based on radar points [22].Subsequently, the corresponding region in the visual image is extracted using this ROI.Finally, object detection is conducted on these images using a feature extractor and classifier.For instance, in [23], a fusion of the mmWave radar and camera vision is proposed for pedestrian tracking, whereby the size of the initial ROI is determined by the distance between the obstacle and mmWave radar.In [24], Kadow et al. applied the Haar-like model for feature extraction, which is a classic feature extraction algorithm for face detection.Data-level fusion operates at the most basic level of information, making it vulnerable to uncertainties, incompleteness, and instability in sensor data [25].Feature-level fusion methods, offering greater flexibility, typically involve layer-by-layer feature extraction and fusion.For instance, in [26], mmWave radar is integrated into visual detection models using frustum correlation to associate heterogeneous data sources.In [27], the authors proposed a transformer-based approach for LiDAR-camera fusion, enhancing robustness against degraded images and sensor misalignment through soft correlation strategies.In [18], to tackle issues such as low angular resolution in mmWave radar, which complicates distinguishing radial objects and leads to false-positive ghost points due to multi-path interference, radar data are augmented with image data containing semantic features.These augmented radar data are then fused by transforming the prediction frame of image modalities into polar coordinate systems.The paper [28] proposes a Spatial Attention Fusion (SAF) module for sensor feature fusion, demonstrating that fusing mmWave radar point clouds significantly enhances detection robustness and improves performance in various weather conditions.Decision-level fusion involves each sensor processing its data independently and making decisions, followed by transmitting the results to a fusion centre for final decision-making; this includes two steps: sensing information processing [29] and decision fusion [30].For instance, some researches have utilised lists of radar detection targets to validate the results of visionbased detection [31].Additionally, reference [30] introduced the motion stereo algorithm to further adjust and refine the final detection outcomes.Mainstream traditional methods for decision-level fusion include D-S evidence theory, Bayesian reasoning, and fuzzy inference theory, among others.Although most fusion detection models demonstrate improved accuracy compared to single-sensor models, they still encounter difficulties in complex environments, poor visual conditions, and open scenarios.
To address the aforementioned challenges, this paper presents a novel radar-camera fusion detection method.The key contributions are as follows: (1) Development of a new fusion detection framework with parallel inputs: This framework enhances feature extraction from visual images and radar point clouds through multiple modules.It aims to generate effective data features for robust detection, even in adverse weather conditions like heavy rain and fog.
(2) Introduction of a novel data correlation and self-supervised learning technique: This method effectively integrates radar point clouds and visual images, leveraging their complementary strengths to enhance the model's robustness.It incorporates an attention mechanism and self-supervised learning for efficient fusion.
(3) Implementation of an unknown category recognition method with confidence assessment: This approach utilizes depth estimation from radar point clouds and visual images to improve the detection and recognition of open-set categories.By leveraging depth information, the model enhances its capability to identify unknown categories while also improving the classification of known categories.
The rest of this paper is organised as follows.Section 2 presents the fusion detection framework, emphasizing robust feature extraction, radar-camera correlation, selfsupervised learning, and open-set recognition.Section 3 conducts experimental validation and comparative analyses.Section 4 provides the ablation study of each proposed module.Finally, Section 5 concludes the entire paper.

Overall Network Structure
To enhance object detection performance in low-visibility and open conditions, we propose a novel deep learning-based radar camera robust feature fusion network (RCRFNet).This network effectively leverages the complementary information from both millimeterwave radar data and visual camera images.
The overall architecture of the RCRFNet, depicted in Figure 1, consists of two parallel sub-networks for processing visual images and radar cloud data, respectively.Additionally, it features the following modules: the feature extraction backbone, the feature association and self-supervised learning module, the radar-camera open-set recognition module, and the multimodal feature detection head.
The feature extraction backbone: It extracts multilevel features from both radar point cloud data and visual images by projecting radar data onto pixel coordinates to obtain multidimensional data.Subsequently, a multi-layer fusion structure, consisting of upsampling and downsampling layers, is utilised to extract high-dimensional features and align data at various resolutions.
The feature association and self-supervised learning module: It innovatively selects high-fidelity radar-camera data pairs to generate self-supervised signals to boost the network's performance by means of a frustum association process applied to the heatmaps of both radar data and visual images.Furthermore, spatial attention and channel attention mechanisms are introduced to enhance the heatmap association.
The radar-camera open-set recognition module: It is designed to improve object classification accuracy, especially for unknown categories.The multimodal feature detection head: Finally, the fused multilevel features are linearly transformed to the same size for splicing, generating the final detection results through the constructed multimodal feature detector.

Feature Extraction Backbone and Heatmap Generation
We introduce a multi-layer fusion structure to merge different layers of radar and visual features, aiming to enhance the detector's capability to detect targets across various scales.It is based on the deep layer aggregation (DLA) backbone from [32] and incorporates deformable convolution layers to capture the geometric deformations of targets.The designed Deformable DLA (DDLA) module is depicted in Figure 2. The DLA34 network is a deep residual network [33] that utilizes a multi-layer hierarchical fusion structure.Its two-dimensional structure is made up of the hierarchical deep aggregation module, which preserves and combines feature channels by joining blocks from different stages into a tree.As a result, the network can integrate both deep and shallow feature information, producing more feature hierarchies and richer combinations.The adopted DLA34 in Figure 2 first goes through a base layer, and then the four tree hierarchies from level 1 to level 4.However, the existing DLA34 relies on standard n × n convolution kernels for feature extraction, which may not effectively capture geometric deformations of objects [34].To address this deficiency, we incorporate deformable convolution layers into the DLA34 backbone, introducing additional parameter directions and expanding the learning range.
Deformable convolution layers integrate extra offsets, enhancing the extraction of features for irregular objects.Figure 3 demonstrates the contrast between deformable convolution and traditional convolution, with arrows indicating the additional offsets introduced.This shows that deformable convolution can represent object features more efficiently.The deformable convolution can be expressed as follows: where w(p n ) is the convolution layer weight, α is the feature map, p 0 is any point in the feature map, p n is the position vector of each point in the convolution kernel with respect to the centre, and ∆p n denotes the positional offset between the feature map and the convolution kernel.A new offset has been introduced at each point, which is generated from the input feature map and another convolution kernel.The image features extracted by the DDLA module are then fed into a tentative detection head to generate visual heatmaps for subsequent processing.Suppose the tentative detector obtains a series of target candidates belonging to C categories; then, the target candidates of the same category are utilised to generate a heatmap.Thus, we have where P i denotes the set of target candidate centres corresponding to category i, and m(P i ) is a Gaussian kernel function applied to the set P i .Let , where i N is the number of target candidates belonging to category i.Then, m(P i ) can be expressed as follows: where x is the coordinate of any point on the i-th heatmap, D(x, p i,j ) is the Euclidean distance between the two points x and p i,j , and σ i,j is the adaptive variance set by the target radius of p i,j .Next, we will introduce radar point cloud data filtering and heatmap generation.To address the issues of sparsity and low-resolution in radar point cloud data, we apply data filtering to enhance data density and suppress noise.Considering the targets' manoeuvring movement, we utilise the extended Kalman filter (EKF) algorithm to preprocess radar point cloud data.
Let (P x , P y ) and (V x , V y ) denote the position and velocity of a target measured by the mmWave radar.The measurement model can be characterised by x + P 2 y arctan(P y /P x ) The Jacobian matrix for the measurement model is [35] x +P 2 y P x P 2 x +P 2 y 0 0 In practice, we take the weighted sum of the current frame data and the Kalman filter data as the preprocessing results.Furthermore, the radar heatmap is generated using Gaussian kernels with the preprocessed point cloud as the centre and a fixed value as the radius.Each point cloud has three channels corresponding to range, azimuth angle, and radial velocity, respectively.Additionally, the point cloud is extended into a pillar with a preset value to address the association deficiency caused by missing height information.

Radar-Camera Association and Self-Supervised Learning
The visual camera and mmWave radar sensors can both offer complimentary data about the target.Visual images provide high resolution and can capture detailed features such as geometry, contours, and texture, but they are sensitive to variations in lighting conditions.Conversely, despite lower data resolution and angular accuracy, radar point cloud data can provide physical attribute information about the target, such as speed, angle, and distance, regardless of the lighting.
To leverage the complementary advantages of both sensors, this paper adopts a process where high-resolution visual image features serve as a reference, and point cloud data with high spatial location matching are then selected for correlation and fusion.In essence, correlated data obtained under favourable lighting conditions are chosen to generate self-supervised signals.Our method aims to enhance the network's feature learning and representation capabilities by leveraging the inherent strengths of both sensor modalities.
Specifically, the frustum association approach is utilised to correlate the radar point cloud with visual images, generating the self-supervised signal for training, as depicted in Figure 4.The target's 3D bounding box is transformed into a frustum coordinate system, with its centre and orientation angle denoted as (O X , O Y , O Z , ϕ).The orientation of the target in the bird's eye view is also indicated by an arrow in Figure 4.The association between the radar point cloud (P X , P Y , P Z ) and the target's bounding box is achieved by calculating the following Euclidean distance: The point cloud with the lowest Euclidean distance is selected, and its heatmap will be concatenated with the image heatmap to generate the final detection results.An example of the frustum association result is shown in Figure 5.We can see that the radar point cloud is well matched with the target in the image.Here, the radar point cloud is shown in the pillar format.
It is important to emphasise that the frustum association approach heavily depends on visual detection results.When visual information is unreliable, it struggles to accurately correlate millimeter-wave radar point clouds.Next, the mean squared error (MSE) loss between the radar heatmap and the frustumassociated heatmap is utilised for self-supervised learning to accelerate the convergence speed of the model: where H o and H a are the radar heatmaps generated by all point clouds and the associated point cloud, respectively.In the prediction phase, the associated radar heatmap is directly concatenated with the image heatmap and fed into the detection head to generate the final results.Additionally, to enhance the quality of the associated radar heatmap, spatial attention and channel attention mechanisms are introduced, as shown in Figure 6.First, features are compressed in the spatial dimension using global average pooling to gain a larger receptive field.Subsequently, normalised weights are obtained through a fully connected layer, activation function, and normalisation operation.Finally, these weights are used to emphasise the importance of each feature channel, resulting in the attention-enhanced associated point cloud heatmap.Moreover, the radar point cloud features are used to filter noise and outliers, thereby reducing false positives during detection.

Radar-Camera Open-Set Recognition Module
For the visual heatmap generation in Equation ( 2), it is assumed that the targets to be detected belong to C predefined categories.However, targets in real-world scenarios are diverse, and environments are complex.Thus, it is challenging to enumerate all possible target categories in advance.When a target to be detected does not fall within one of the C predefined categories, target detection performance may significantly decrease.To address this issue and improve the approach's performance in real-world applications, recently developed open-set techniques are introduced, and a radar-camera open-set recognition (RCOSR) module is designed.
The RCOSR leverages the complementary strengths of both sensors to improve the recognition accuracy of unknown class targets.Visual images, with their high resolution and detailed target information, enable highly confident target category predictions based on visual features.However, they are vulnerable to weather and lighting conditions, and the depth information they provide is often ambiguous.In contrast, radar provides more accurate depth information than visual images [36].Therefore, we propose a novel depthinformation-constrained feature aggregation method for open-set recognition, as depicted in Figure 7. Specifically, we calculate the global and local depth consistencies between the radar point cloud and visual images, which are then multiplied with visual image features to construct known target confidence levels, thus achieving unknown target detection.The global depth consistency is quantified using the Kullback-Leibler (KL) divergence between the radar depth image d R (x, y) and the camera depth image d C (x, y), as shown below: The local depth consistency D(x, y) at the coordinate (x, y) is calculated using the following formula: Additionally, considering that the occurrence probabilities of each target category in real-world scenarios may vary significantly and affect the discrimination of unknown classes, a new weight σ i measuring the occurrence probability is introduced.Finally, the weighted confidence level of known class targets is determined: A target candidate with S(x, y) ≤ S t is classified as an unknown class target, where S t is a threshold set according to the confidence probability.

Multimodal Feature Detection Head
The multimodal feature detection head comprises two stages.The first stage, comprising a convolution layer and an activation function, is designed to generate target proposals for radar-camera association.The second stage generates the final fusion features through point-to-point splicing matching between the fusion heatmap and the camera heatmap.The structure of the detection head is illustrated in Figure 8.This two-stage design ensures more reliable detection.The loss function of the proposed model has two parts: the focal loss and the binary cross-entropy (BCE) loss, which can be expressed as The focal loss L pre,cls and L f usion,cls are associated with category information, while the BCE loss L pre,reg and L f usion,reg pertain to size information.This combination allows the model to effectively learn and optimise for both category and size information.

Experiments and Results
The experiments are divided into three sections: Section 3.1 presents the overall detection performance of the proposed approach, Section 3.2 focuses on robustness analysis, and Section 3.3 primarily verifies the open-set performance.
For comparison, three state-of-the-art (SOTA) detection algorithms for visual images-Mono3D [37], CenterNet [38], and FCOS3D [39]-as well as the latest fusion detec-tion method, CenterFusion [18], are selected as benchmarks.For the open-set recognition module, the traditional Openmax method is chosen for comparison because current mainstream radar-camera fusion detection methods rarely address the open-set issue.
We used the nuScenes dataset [40] for the experiments.This dataset comprises data from mmWave radar, cameras, and other sensors, as detailed in Table 1.We selected data from over 1000 scenes in various environments and lighting conditions.Each scene contains 40 keyframes and 20 s of data.The experiments were conducted on an Ubuntu 18.4 system, equipped with four GeForce RTX 4080 GPUs (NVIDIA, Santa Clara, CA, USA), each with 24 GB of RAM.The Adam optimizer was used to iteratively update the network weights, with a weight decay rate set to 0.0001.Following a step learning strategy, the learning rate was halved every 20 epochs.A total of 50 epochs of training were performed.The training parameter settings are summarised in Table 2. Preprocessing provided by the nuScenes official website was applied to the dataset during the experiments.To evaluate detection accuracy, we use the average precision (AP) metric, where a match is determined by thresholding the two-dimensional centre distance on the ground plane.The mean average precision (mAP) is calculated as follows: where C denotes the number of predicted classes and M represents the distance threshold between the predicted bounding box and the ground truth.In this study, we set the threshold values to M = 0.5, 1, 2, 4 m.Additionally, the experiments calculate five true-positive (TP) metrics: average translation error (ATE), average scale error (ASE), average orientation error (AOE), average velocity error (AVE), and average attribute error (AAE).For each of these metrics, the mean true positive (mTP) for all classes can be derived as follows: Following this, the nuScenes detection score (NDS) is computed as This metric effectively encapsulates various aspects of the detection task, including velocity and attribute estimation.

Overall Detection Performance
The comparison results of our approach with SOTA algorithms are presented in Table 3, where all the aforementioned metrics are evaluated.Among these metrics, NDS and mAP are the two most representative, with higher values indicating better performance.As shown in Table 3, our approach achieves the highest NDS and mAP scores.Specifically, the NDS improves to 0.46, representing a 2.5% relative improvement over CenterFusion.Additionally, the AOE and AVE show significant relative improvements of 12.6% and 9.8%, respectively.Our approach incorporates radar information through the RCSL module, leading to a substantial enhancement in the accuracy of target speed estimation, as reflected in the AVE metrics.In summary, the proposed method effectively fuses mmWave radar point cloud data with visual images, resulting in improved overall detection performance.Figure 9 illustrates several visualisation results of our proposed approach.We selected three representative frames from the test data, which include scenarios with pedestrian interference, target occlusion, and multi-scale small objects.The target centre, ground truth, and detection results for these frames are displayed from left to right.Our method demonstrates notable proficiency in detecting and identifying distant objects (marked in green), occluded objects (marked in yellow), and obstacles (marked in orange).For autonomous driving, target detection results are occasionally converted into bird's eye view (BEV) to aid in downstream planning and control tasks.Figure 10 shows several BEV plots of detection results in typical scenarios.These plots demonstrate that RCRFNet delivers exceptional detection results by fusing radar and camera data.

Robustness Performance Analysis
To validate the robustness of the proposed method, we conducted a comparative analysis under various visibility conditions.In this section, experiments were performed with images of four different visibility levels, while keeping other parameter settings unchanged.This setup simulates different weather conditions.Table 4 presents the experimental results for CenterFusion and our RCRFNet.In this table, the maximum loss value, calculated as the difference between the performance index of the highest visibility image and that of the lowest visibility image, is used as the robustness criterion.It can be observed that the RCRFNet exhibits a lower maximum loss value for NDS, mAOE, mAVE, and mAP, indicating higher robustness under varying weather conditions.Table 5 details the performance of CenterFusion and the RCRFNet method across four visibility levels: 0-40%, 40-60%, 60-80%, and 80-100%.In these fine-grained comparative experiments, the training data volume is significantly reduced due to data screening, posing additional challenges for detection models.Despite this, our method outperforms CenterFusion, thanks to the implementation of various new processing modules that effectively integrate the complementary strengths of radar point cloud data and camera image data.Figure 11 displays several detection results of the RCRFNet method under harsh environmental conditions.From top to bottom, the images are presented as the original image, ground truth, detection results, and image with the object's depth information.

Open-Set Performance Analysis
In this subsection, we aim to verify the effectiveness of the RCOSR module proposed in this paper.We divide the dataset categories into known and unknown categories.During training, only the known categories are used, and the trained model is then applied in open-set recognition experiments.A comparison with the Openmax method under two different visibility levels is shown in Table 6.To ensure a fair comparison, the number of training iterations is set to 30, while other parameters remain unchanged.Table 7 presents detailed performance results for the closed-set categories of Openmax and the RCRFNet method.The experimental results on the nuScenes dataset demonstrate that the radar-vision fusion open-set recognition method proposed in this paper achieves better detection accuracy.These findings confirm the effectiveness of the RCOSR module to a certain extent.

Ablation Study
This section describes the ablation experiments on the NuScenes dataset, using Center-Fusion as our baseline to validate the effectiveness of each module: deformable convolution layers, RCSL, and RCOSR.
In the experiments, we incrementally included the modules of deformable convolution layers, RCSL, and RCOSR into our model.The results are summarised in Table 8.The findings indicate that the deformable convolution layers module improves NDS compared to baseline.The RCSL module, which primarily enhances robustness, improves the mAAE and mAOE metrics.The RCOSR module, by computing a contrastive loss between radar depth information and the depth predicted from images, corrects the depth prediction derived from images, significantly enhancing both mAP and NDS.Notably, the NDS index shows an improvement of 8.9%, demonstrating the effectiveness of incorporating radar depth information to enhance detection performance.
Furthermore, the comparisons of the model parameter size and the NDS metric between our approach and SOTA methods are listed in Table 9.It can be seen that the proposed RCRFNet has only a 4.1% model size increase compared to CenterFusion but has achieved a 2.5% improvement in the NDS metric.To demonstrate the superior performance of our approach compared to the Center-Fusion method under various scene conditions, Figure 12 presents detection results for scenarios including occluded obstacles, small objects in low-light conditions, long-distance observations, and lens-contaminated scenes.It is evident that our approach is more effective in detecting distant objects and handling scenes with occlusions and lens contamination.Qualitative comparison between our approach and CenterFusion under various scene conditions.For each row, the images from left to right are the raw image, the detection result of CenterFusion, and the result of our approach.

Figure 1 .
Figure 1.The overall architecture of the proposed RCRF network.

Figure 3 .
Figure 3. Contrast between (a) traditional convolution and (b) deformable convolution.The arrows in (b) indicate the additional parameter offsets introduced by deformable convolution.

Figure 4 .
Figure 4.The frustum association approach in the bird's eye view.

Figure 5 .
Figure 5.An example of the frustum association result.

Figure 6 .
Figure 6.The attention module for enhancing the quality of the association heatmap.

Figure 9 .
Figure 9. Results of our approach under harsh environmental conditions.

Figure 11 .
Figure 11.Results of our approach under harsh environmental conditions.

Figure 12 .
Figure12.Qualitative comparison between our approach and CenterFusion under various scene conditions.For each row, the images from left to right are the raw image, the detection result of CenterFusion, and the result of our approach.

Table 3 .
Comparison results with SOTA algorithms on the nuScenes dataset.↑ indicates that higher is better and ↓ indicates that lower is better.

Table 4 .
Comparison of the maximum loss value under varying image visibility levels.

Table 5 .
Performance for four different image visibility levels.↑ indicates that higher is better and ↓ indicates that lower is better.

Table 7 .
Comparison details for closed-set categories.

Table 8 .
Ablation experiments for each module.↑ indicates that higher is better and ↓ indicates that lower is better.

Table 9 .
Comparisons of model size and NDS metric.