D-S Augmentation: Density-Semantics Augmentation for 3-D Object Detection

Cameras and light detection and ranging (LiDAR) sensors are commonly used together in autonomous vehicles to provide rich color and texture information and information on the location of objects, respectively. However, fusing image and point-cloud information remains a key challenge. In this article, we propose D-S augmentation as a 3-D object detection method based on point-cloud density and semantic augmentation. Our proposed approach first performs 2-D bounding box detection and instance segmentation on an image. Then, a LiDAR point cloud is projected onto an instance segmentation mask, and a fixed number of random points are generated. Finally, a global ${N}$ -nearest neighbor clustering is used to associate random and projected points to give depth to virtual points and complete point-cloud density augmentation (P-DA). Then, point-cloud semantic augmentation (P-SA) is performed, in which the instance segmentation mask of an object is used to associate it with the point cloud. The instance-segmented class labels and segmentation scores are assigned to the projected cloud, and the projection points added with 1-D features are inversely mapped to the point-cloud space to obtain a semantically augmented point cloud. We conducted extensive experiments on the nuScenes (Caesar et al., 2020) and KITTI (Geiger et al., 2012) datasets. The results demonstrate the effectiveness and efficiency of our proposed method. Notably, D-S augmentation outperformed a LiDAR-only baseline detector by +7.9% in terms of mean average precision (mAP) and +5.1% in terms of nuScenes detection score (NDS) and outperformed the state-of-the-art multimodal fusion-based methods. We also present the results of ablation studies to show that the fusion module improved the performance of a baseline detector.


I. INTRODUCTION
T HE 3-D object detection is a fundamental step in perception systems used in autonomous driving to locate and categorize a set of objects in 3-D space. Existing methods primarily use cameras and light detection and ranging (LiDAR) sensors. Cameras can provide rich color and texture information but lack depth information. Conversely, LiDAR point clouds can accurately represent the distance of objects, but their distribution is disordered, irregular, and sparse. Hence, 3-D object detection remains challenging.
Existing 3-D object detection methods can be divided into two categories: single-modality 3-D object detection based on LiDAR and multimodal methods using a fusion of cameras and LiDAR sensors. Among these techniques, the most common method based on LiDAR performs voxelization to extract the features of point-cloud data. The pioneering method VoxelNet [1] utilized sparse voxel grids, and the authors proposed a new voxel feature encoding layer to extract features from points within voxel units. Several studies have considered similar voxel encoding strategies [5], [6], [7], [8], [9], [10], [11], [12]. SECOND [2] simplifies VoxelNet [1] and accelerates sparse 3-D convolutions. These voxelization-based methods exhibit superior real-time performance and have been widely implemented in mobile terminals. However, due to the sparseness of point clouds, methods based only on voxelization typically do not suffice for robust 3-D detection. For example, small or distant objects are voxelized with fewer features, which makes them difficult to detect using LiDAR. Fig. 1 shows the voxelization process performed by the SECOND method at different stages [2]. Evidently, the original point This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ cloud included 79 points at a range of 53 m, whereas 18 and 5 feature points remained after the two voxelization steps. This reduces the complexity of the next processing step but involves the disadvantage that the number of available points does not suffice for distant objects. Conversely, these objects are clearly visible and distinguishable in high-resolution images. The complementary roles of point-cloud and image data have motivated the development of multimodal detection methods that combine the two modalities.
Existing multimodal 3-D object detection methods can generally be divided into three categories: 1) feature-level fusion [15], [16], [17]; 2) decision-level fusion [19], [20]; and 3) sequence fusion [33], [34]. Each method involves some notable limitations. In particular, feature-level fusion methods exploit the concatenation between different features, perform joint inference, and allow interactions between different feature layers. Similarly, MV3D [15] and AVOD [16] utilized a region proposal network (RPN) to process camera and LiDAR feature maps to generate 3-D regions of interest (RoI) and bounding boxes. However, the adoption of different methods to preprocess point-cloud data remains a core problem in the design of fusion methods. Hence, MVX-Net [21] was designed to project the voxel features of a point cloud onto an image feature map and then use an RPN model to perform the 3-D object detection on the projected and voxel features. Although this method can reduce the loss of information caused by changes in perspective, the voxel quantization process requires considerable computational resources. Decision-level fusion performs object detection using each modality separately and subsequently fuses the detection results [22]. This method has the advantage that different decision outputs can be processed in parallel. It is also relatively robust and is less affected by the failure of a single sensor. Recently, the camera-LiDAR object candidates fusion (CLOCs) method [20] used the geometric and semantic consistency of 2-D and 3-D detection boxes to achieve the improved performance, but the detection results were also affected by the performance of the 2-D and 3-D detectors used. Sequence fusion is a new fusion approach that augments the original point cloud with the final image detection results. PointPainting [33] sequentially fuses the image semantics of an object with an input point cloud to increase the dimensionality of the point cloud. This method effectively increases the point-cloud features of an object. MVP [34] further adopted an image segmentation mask to generate virtual points to supplement a sparse point cloud and input the dense point cloud into CenterPoint [7] to complete the detection process. These two new fusion techniques have a flexible modular structure and can realize point-cloud augmentation at the input source. However, we consider that making full use of the detection results of the image and fusing the information most conducive to point-cloud detection is a key element in sequence fusion. PointPainting [33] does not consider the difference in the corresponding weights of different positions in the instance mask, and MVP [34] ignores the influence of nearest neighbors in estimating the depth of virtual points.
To address these issues, in this study, we propose an efficient multimodal fusion-based method called D-S augmentation to augment the density and semantic information of point-cloud data. The main contributions of this study are summarized as follows.
1) Point clouds cannot provide dense information about objects in a given scene due to their sparsity characteristics. We propose a simple yet effective point-cloud density augmentation method that uses the global N-nearest neighbor data to correlate the depths of sampled points with real points to generate virtual points that are more in line with the real environment. 2) To enrich the features of the original point cloud, we propose a semantic augmentation method in which the semantics of an image object are used to decorate the point cloud and a Gaussian mask is utilized to generate a point-cloud encoding. 3) We conducted extensive experiments on the large-scale nuScenes [14] and KITTI [36] datasets to verify the effectiveness of the proposed method. The results show that our approach achieved better performance than previous state-of-the-art methods in terms of 3-D object detection accuracy.
II. RELATED WORK We first review the literature on 3-D object detection with LiDAR alone and then introduce multimodal fusion-based 3-D object detection methods.

A. LiDAR-Only 3-D Object Detection
Existing LiDAR-based 3-D object detection methods can be roughly divided into two categories, including grid-and point-based methods. Grid-based methods usually divide point clouds into regular 3-D voxels [1], [2], [7], [26] or bird's-eye-view (BEV) maps [3], [23]. In [1] and [2], voxel features were extracted using convolutional neural networks, and 3-D object detection was performed by converting the raw point cloud into a compact voxel representation. CenterPoint [7] replaced the ordinary anchor-based method with an anchor-free approach. Among point-based methods [4], [24], [37], PointRCNN [4] and STD [24] used PointNet [25] and PointNet++ [13] to directly segment the original point cloud to obtain foreground points and generated a proposal for each point. 3-DSSD [37] was proposed as a single-stage detector designed to handle upsampling layers and refinement modules for greater computational efficiency. Compared to point-based methods, grid-based methods are relatively computationally efficient and can be trained faster on large-scale datasets [14], [36].

B. Multimodal 3-D Object Detection
In recent years, methods based on multimodal fusion have been developed extensively. Several studies [27], [28], [29], [30] have utilized an image-based approach to generate region proposals and then processed the points in a region of interest (RoI) to detect objects in 3-D. However, these methods involve some limitations based on 2-D object detection results. Similarly, a two-stage detection method called MV3D [15] was proposed. One stage was used to generate 3-D object proposals, and the other for multiview feature fusion was used to realize feature interaction between different intermediate layers. AVOD [16] was proposed as an object detection network for aggregated views that generated reliable 3-D object proposals for objects of different categories by performing multimodal feature fusion on high-resolution feature maps. Some methods have also been developed to convert front-view camera features into BEV features [18], [31], [44]. ContFuse [31] proposed a novel end-to-end continuous fusion architecture that utilized continuous convolutions to fuse images and LiDAR feature maps at different levels of resolution. Similarly, 3-D-CVF [44] adopted an autocalibrated projection method designed to smoothly fuse LiDAR feature maps and image feature maps. Although these methods achieved good results, they still involved problems with feature blurring. Conversely, other methods [21], [49], [33] have considered pointwise fusion mechanisms. MVX-Net [21] and PointPainting [33] used feature extraction and semantic segmentation networks to obtain image features and segmentation scores, respectively, and made simple connections with point clouds. MMF [32] is an end-to-end learning network designed to interpret 2-D and 3-D object detection, as well as to perform ground estimation and depth completion. This method has attracted considerable attention due to its ingenious fusion of timing and strategy.

C. Point-Cloud Augmentation
Point-cloud data collected by LiDAR sensors are characterized by sparsity, and point-cloud augmentation methods aim to generate a denser point cloud from the sparse LiDAR data.
Representative methods can be divided into three categories. Image-based methods [50], [51] use depth completion to restore real-world objects from sparse measurements. Similarly, LiDAR-based methods [52], [53] focus on learning multilevel features for each point and expand the point set by implicit multibranch convolutional units to reconstruct multiple upsampled point clouds from each high-dimensional feature vector. Conversely, methods based on multimodal fusion [33], [34] use semantic information in image data to enhance point-cloud data and have achieved state-of-the-art detection performance.
Although various multimodal fusion methods have been proposed in recent years, in general, they do not easily outperform methods based on LiDAR alone. In this study, we propose an efficient point-cloud density and semantic augmentation method called D-S augmentation to overcome this challenge based on improved image representation and fusion methods.

III. D-S AUGMENTATION ARCHITECTURE
D-S augmentation focuses on multimodal sequence fusion to augment input data collected as LiDAR point clouds for 3-D detectors. We outline our framework in Fig. 2, which consists of four main stages.
1) Instance segmentation is performed on the image data. We utilize a 2-D detector [35] to obtain instance segmentation masks for multiview images, which include pixelwise class labels. 2) The density of the point cloud is augmented (P-DA). Analogous to MVP [34], we project the original point cloud onto an instance segmentation mask and generate new virtual points based on the class and coordinates of the mask associated with each point. In the proposed approach, we obtain depth information for virtual points using a more general global N-nearest neighbor data clustering method.
3) Point-cloud semantic augmentation (P-DS) is performed by applying instance segmentation and Gaussian masks to decorate the original point cloud to enrich the features of the data. 4) We input the augmented point cloud to CenterPoint [7] to complete the 3-D object detection procedure.

A. Point-Cloud Density Augmentation
Existing LiDAR point-cloud data are not always complete and satisfactory due to the distance to objects in the scene, selfocclusion, and limited sensor resolution. Therefore, recovering complete point clouds from partially sparse raw data is crucial, especially by increasing the density of object point clouds. The recent MVP [34] method utilized the prediction results of images to generate dense 3-D virtual points for the first time in large-scale outdoor scenes, which inspired us to further explore high-quality point-cloud density augmentation. As shown in Fig. 3, we first feed the image into a 2-D detector [35] (note that although the 2-D detector here is capable of both object detection and instance segmentation tasks, only the instance segmentation result is used in P-DA). The original point cloud is then projected onto the instance segmentation mask (the specific projection steps are described in detail next). We then generate a fixed number of random points on the instance segmentation mask using a random sampling method and correlate the randomly sampled points s i with the projected points p j using the global N-nearest neighbors (GNNN) method to obtain the depth d i of the virtual point (note that virtual points are randomly sampled points with depth). Finally, the new point cloud is reverse-mapped to the original point-cloud space to obtain the final high-density virtual point cloud.
1) Global N-Nearest Neighbor Data Association: Here, we introduce the specific process used to generate virtual points in detail. In MVP [34], only the local nearest neighbor data association algorithm is used to obtain the depth of the  raw point cloud corresponding to the nearest projection point as the depth feature of the virtual point, without considering the position information of the surrounding points. This results in the virtual point being unable to truly reflect the shape of the object. Therefore, we propose a global N-nearest neighbor data clustering or association method designed to generate a more accurate virtual point depth d i . As shown in Fig. 3, the coordinates of the virtual point on the instance mask are the coordinates of random sampling point s i . Our method is centered on this coordinate, and all the projected points p i (white dots in the figure) within the range of the object mask are involved in the association of the virtual point depths. Simultaneously, we assign different weights to the depth d j (i.e., the depth of the raw point cloud) corresponding to the projection points according to the distance. Furthermore, we assign larger weights to the projection points closer to the random sampling points and vice versa. Specifically, we first use (1) to calculate the Euclidean distance y i between the random sampling point s i and the projection point p j . The corresponding weights w j at different distances are then obtained using the inverse-distance weighted interpolation method of (2), where α is a positive constant and the default value is 2. Finally, the depth d i of random sampling points with global features is obtained by the weighted sum method of (3) and then inversely mapped to the 3-D point-cloud space to obtain the final virtual point cloud (3)

B. Point-Cloud Semantic Augmentation
The density-augmented point cloud can also provide 3-D object location and geometric shape information in addition to compensating for the sparseness of the point cloud. Because point clouds do not include color and texture information, extracting deeper features of the objects becomes the main challenge. To fully utilize the semantic information obtained from images, we propose a point-cloud semantic augmentation method (P-SA), as shown in Fig. 4. We mainly use two methods to fuse images with point-cloud data to enrich the input features of the point clouds, including category and Gaussian semantic augmentation.
1) Category Semantic Augmentation: PointPainting [33] proposed decorating a raw point cloud with the class and score of an image segmentation mask, which was also adopted by a later 3-D detection model [43]. We include PointPainting as part of P-SA; however, we utilize object-oriented instance masks instead of global segmentation masks as in the original paper [33]. In particular, as shown in Fig. 4, we first feed the image into our pretrained Mask-RCNN [35] instance segmentation network to obtain the class label mask of the object pixel and encode it with a one-hot encoding. The point cloud is then projected onto the corresponding instance segmentation mask, and category labels c i of the relevant pixels are attached to the point cloud to obtain a semantically augmented point cloud. Finally, the augmented projection points are inversely mapped 2) Gaussian Semantic Augmentation: Before introducing the proposed method, we describe the baseline 3-D detector [7] considered in this work. Unlike anchor-based detection methods [1], [2], [40], [41], CenterPoint [7] is an anchorfree centerpoint-based 3-D object detector [23], [37], [38], [39]. Because objects in 3-D space do not exhibit any specific fixed orientation, anchor-based detectors have difficulty in enumerating all orientations or fitting an axis-aligned bounding box to a rotated object. To address this problem, CenterPoint [7] was proposed as a method based on centerpoint detection. Specifically, we first rely on a standard 3-D backbone represented by ground-view features extracted from LiDAR point clouds and generate heatmap peak points at the center of any detected object. Then, the detection head of the 2-D CNN architecture is used to capture the object centerpoint (Fig. 5). The centerpoint contains the position refinement o of the subvoxel, the height h g above the ground, the size s of the 3-D bounding box, the yaw rotation angle α, and other characteristics. The centerpoint feature is then used to calculate the entire 3-D bounding box, including its size, direction, and speed using a regression method. Point features are extracted from the predicted box using the 3-D center of each face of the estimated 3-D bounding box. The final input is fed into a multilayer perceptron (MLP) model to predict the final confidence score and bounding box.
For CenterPoint [7], not all object point clouds are considered equally, and the centerpoint feature of the object is more important. Therefore, we further explored a method to efficiently combine the image information to extract the features of 3-D centerpoints. In particular, as shown in Fig. 4, we constructed a Gaussian mask based on the detection results of 2-D objects and continued the above category semantic augmentation method, connecting the Gaussian semantics of the object in the image and the original point cloud in the 3-D space. The construction of the Gaussian mask is crucial because it can map the aggregated range of the object center points in the image to the point-cloud object We used (4)

C. Coordinate Transformation
In this section, we introduce the key coordinate transformation process in multimodal fusion, that is, the transformation from the LiDAR coordinate system to an image coordinate system aligns the point-cloud and pixel data. The representations and coordinate transformation of LiDAR point clouds differ between the nuScenes [14] and KITTI [36] datasets, which are commonly used for 3-D object detection. Point clouds are represented as (x, y, z, r, t) and (x, y, z, r ) in the nuScenes [14] and KITTI [36] datasets, respectively, where (x, y, z) represents the position coordinates of the point cloud in 3-D space, r is the reflectivity of the LiDAR, and t is a timestamp unique to the nuScenes dataset.
On the nuScenes [14] dataset, our method projects the point cloud onto the image to obtain the corresponding projected point coordinates on the image. The coordinate transformation formula is given as follows: where t 1 and t 2 represent the time of the self-frame when the camera and LiDAR are captured, respectively; T (lidar→car) is the coordinate transformation from the LiDAR to the selfvehicle reference system; T (t 2 →t 1 ) is the coordinate transformation from the capture time of the LiDAR to that of the camera; and T (car→camera) is the transformation from the vehicle coordinate system to the camera coordinate system.
Finally, the point cloud is projected onto the image using the intrinsic parameter matrix M 1 inherent to the camera. Compared to the nuScenes [14] dataset, the coordinate transformation on the KITTI [36] dataset is relatively simple, mainly because the cameras and LiDAR data included in the nuScenes dataset were collected at different frequencies.
In the KITTI [36] dataset, only LiDAR must be converted to the camera coordinate system. The formulas used are given as follows: where (u, v) are the image coordinates converted by LiDAR, M 1 is the inherent 4 × 3 matrix obtained after camera calibration, M 2 is the extrinsic parameter matrix converted from LiDAR to camera, R represents the rotation matrix between the camera and LiDAR, and t represents the translation matrix between the camera and LiDAR. The camera and point cloud are transformed to the same coordinate system via coordinate transformation, which is convenient for multimodal fusion processing.

IV. EXPERIMENTS AND RESULTS
We evaluated the proposed TSF on the KITTI dataset [4] and conducted an extensive ablation study to verify the effectiveness of our proposed approach.

A. Datasets
The nuScenes [14] dataset is the first large-scale multiscene 3-D object detection dataset to provide a full set of sensor data for autonomous vehicles, with more than 1000 scenes collected in Singapore and Boston. The data included in this dataset were collected by six multiview cameras, a 32-line LiDAR, 5-mm wave radars, a GPS device, and an IMU. It provides object annotation results within 360 • of the desired ten classes of objects. Compared with the KITTI [36] dataset, it contains more than seven times as many object annotations. The dataset consists of a training set with 28 130 frames, a validation set of 6019 frames, and a testing set of 6008 frames. We followed the official division of the dataset [14] and used 28 130, 6019, and 150 frames for training, validation, and testing, respectively.
The KITTI [36] dataset is also a commonly used public dataset for autonomous driving scenarios, and its acquisition platform consists of two grayscale cameras, two color cameras, one LiDAR, four optical lenses, and a GPS navigation system. A total of 14 999 data samples are included in the 3-D object detection dataset, of which 7481 are allocated to the training set and the remainder to the testing set. We divided the training set into training and validation sets with a ratio of 1:1. Following the official division of the KITTI [36] dataset according to the degree of vehicle occlusion and truncation, we divided the evaluation indicators into three levels: easy, moderate, and hard. Comparing the accuracy of cars, pedestrians, and cyclists at this level, we set the intersection over union (IoU) overlap thresholds of 0.7, 0.5, and 0.5, respecively.

B. 2-D Detector
Our implementation is based on the open-source code of the Mask-RCNN 1 2-D detection model, which can perform object detection and instance segmentation. Unlike semantic segmentation, instance segmentation identifies object contours at the pixel level. We tuned and pretrained a Mask-RCNN model [35] on the nuImages [14], KITTI [36], and CityScape datasets [45]. Because there were only 512 images in the KITTI dataset for training image segmentation, we used the Cityscape dataset [45] commonly used for instance segmentation to perform pretraining and finally performed fine-tuning on the KITTI instance segmentation dataset [36]. During training, we used the Adam optimizer with a batch size of 16, a learning rate of 0.02, and 9000 iterations.

C. 3-D Detector Details
Our proposed 3-D detector uses CenterPoint [7] and utilizes the voxelization-based approach PointPillar [3] as the backbone of CenterPoint [7]. For the voxelization process, we take the range of the X-and Y -axes as [-51.  [7] was trained for 20 epochs on four RTX 3070 Ti GPUs using the Adam optimizer and single-cycle learning rate policy [46]. The mini-batch size was set to 16, the maximum learning rate was set to 3e −3 , the division factor was set to 10, the momentum range was 0.95-0.85, and the weight decay parameter was 1e −2 . DS sampling [5] was adopted to alleviate the class imbalance problem in the nuScenes [14] dataset. During the inference process, we generated 50 virtual points for each 2-D object in the scene.

D. Evaluation Metrics
The evaluation metrics of nuScenes include the mean average precision (mAP) and nuScenes detection score (NDS) [14]. mAP is the average of the BEV center distance D = {0.5, 1, 2, 4} m and the matching threshold of the C class set. NDS is a weighted combination of the mAP and other object attribute detection results, consisting of the average translation error (ATE), average scale error (ASE), average orientation error (AOE), average velocity error (AVE), and average attribute error (AAE), as given in the following: where mAP is the average mean precision, A P c,d is the mean precision, and T P c is a set of five mean error metrics. Therefore, NDS not only includes detection performance but also quantifies detection quality in terms of box position, size, orientation, attributes, and speed to comprehensively evaluate 3-D object detection results. 1 https://github.com/matterport/Mask_RCNN

E. nuScenes Results
We mainly evaluated our D-S augmentation method on the nuScenes validation set to validate the performance improvements obtained by the proposed point-cloud density and semantic augmentation methods. We compared the proposed D-S augmentation with the baseline CenterPoint algorithm [7]. We also added PointPillar [3], PMPNet [47], PointPainting [33], CVCNet [48], HotSpotNet [39], and MVP [34] under the same settings. Table I lists the ten classes of AP, mAP, and NDS implemented by the 3-D object detection network. As may be observed from Table I, our D-S augmentation method achieved results of 71.3% and 75.3% in terms of the mAP and NDS metrics on the nuScenes dataset, respectively. In addition, it improved the performance of the baseline network CenterPoint [7] by 7.9% in terms of mAP and 5.1% in terms of NDS. Our method showed performance gains for all object classes.

F. KITTI Results
To evaluate the generality of our method, we also conducted comparative experiments on the KITTI dataset, which is commonly used for 3-D object detection. First, we used the trained Mask-RCNN [35] model to mask instances of 3-D objects in the scene and to generate 50 virtual points for each object in each scene. We then assigned the class semantic features to the point cloud after instance masking by the 2-D detector Mask-RCNN [35]. Finally, a point cloud with augmented density and semantics was input to the 3-D detector. Table II presents the detection results of our method for the three categories of cars, pedestrians, and bicycles. The performance of our method was slightly lower than that of the SE-SSD model [32]. However, our method outperformed the other methods in the moderately difficult car class. For the challenging pedestrian and cyclist detection tasks, our method achieved mAP values of 61.61% and 69.19%, respectively, under the 3-D AP criterion. It also achieved competitive performance compared to other methods overall. The experimental results show that our method also performed well on the KITTI dataset [36].

V. ABLATION STUDIES
To verify the effectiveness of the proposed D-S augmentation method, we designed ablation experiments for different network modules. All ablation experiments were based on the CenterPoint model [7], which we evaluated on the nuScenes [14] validation set. The results are shown in Table III. A. Point-Cloud Density Augmentation 1) Quantitative Analysis: As shown in Table III, we used CenterPoint [7] as a baseline network to evaluate the performance of the P-DA method for 3-D object detection tasks on the nuScenes dataset [14]. As may be observed from Table III, the average detection accuracy (mAP) of the P-DA method on the nuScenes dataset was 69.4%, and the NDS was 73.5%, which was a 6.0% improvement in the mAP compared to the baseline CenterPoint network [7] and a 3.3% improvement in NDS. The experimental data confirmed that the P-DA method can improve the performance of 3-D object  I  PERFORMANCE COMPARISON ON THE NUSCENES VALIDATION SET. WE SHOW NDS AND mAP FOR EACH CLASS. THE HIGHLIGHTED REGIONS  REPRESENT THE RESULTS BASED ON OUR METHOD   TABLE II  COMPARISON     computational resources by a relatively small amount while improving detection accuracy [7].
2) Qualitative Analysis: Here, we provide visualizations of the virtual points generated by our method and describe the results of qualitative comparisons with the original point cloud. We visualized four scenarios, and the proposed approach performed well in each case. Specifically, in Fig. 6(a)-(c), due to the occlusion problem between the main objects and the small object, the objects were very sparse in the original point cloud, and some objects were even represented by only a single point. Therefore, objects could not be distinguished and detection was relatively difficult. Our method can complement the sparse objects in the original point cloud without changing the original cloud itself. It can effectively reduce the false detection and missed detection of 3-D objects, especially for small objects and sparse objects at a longer distance. The accuracy of 3-D object detection was also improved while ensuring real-time performance. We chose nighttime scenes to verify the robustness of P-DA. In Fig. 6(d), it may be observed that our P-DA method still performed well even with poor light quality.

B. Point-Cloud Semantic Augmentation
In this section, we analyze the performance of P-SA in detail.
1) Quantitative Analysis: As may be observed from Table III, after adding our proposed P-SA method to the original CenterPoint [7] method, our method improves the 3-D object detection accuracy by 6.2% and 3.8% for the mAP and NDS evaluation metrics, respectively. The P-SA method can help the network to understand the point-cloud data better by fusing the category semantic information of the image into the point cloud. Moreover, the proposed Gaussian semantic augmentation method not only enhances the central features of the object but also reduces the possibility of false detection and missed detection of objects for improved performance. Compared with CenterPoint [7], the detection speed was reduced by 3 ms after adding the P-SA method, and the model still exhibited good real-time performance.
2) Qualitative Analysis: We visualized the qualitative results of the P-SA method, including the category semantics and Gaussian semantic augmentation. As shown in Fig. 7, we selected images from three camera angles in the nuScenes [14] dataset, including front left, front, and front right. As shown in Fig. 7(a) and (b), the method not only segmented the semantic information of the image well but also accurately drew it onto the point cloud to enhance the semantic features of the object point cloud. Fig. 8 shows the results of Gaussian semantic augmentation. The fusion process is similar to that of semantic augmentation. As may be observed from Fig. 8(a) and (b), our method was able to assign the central region features of the 2-D objects to the point cloud effectively, which improved the subsequent center-based performance of the 3-D point detectors.

C. Point-Cloud Density Augmentation
This section analyzes the overall performance of our D-S augmentation method. 1) Quantitative Analysis: As shown in Table III, we simultaneously added the P-DA and P-SA methods together with the 3-D detector CenterPoint [7] to construct our overall D-S augmentation framework, which achieved the results of 71.3% and 75.3% in terms of the evaluation metrics mAP and NDS on the nuScenes [14] dataset, respectively. Compared to the LiDAR-only CenterPoint [7] method, our D-S augmentation achieved improvements of 7.9% and 5.1% in terms of mAP and NDS, respectively. It outperformed most fusion-based methods on benchmarks on the nuScenes [14] dataset. To verify the effectiveness of P-DA and P-SA methods at different distances, we divided the KITTI validation set [14] into ranges of 0-30 m and 30-50 m, and the results are shown in Fig. 9(a). On the validation set of moderate difficulty, for the Fig. 10. Illustration of the qualitative analysis of D-S augmentation on the nuScenes dataset [14]. On the left is the detection result of CenterPoint [7]; on the right is our method. Detected objects are denoted by green bounding boxes; object points in the point cloud are denoted in red; orange and purple bounding boxes denote the difference between the detection results of CenterPoint [7] and our method.
3-D AP of automobile detection, P-SA significantly improved performance in the range of 0-30 m by 6.54%, which occurred because the close point cloud can obtain more detailed image semantics. For long-range objects at the range of 30-50 m, P-DA played a key role, resulting in a 7.21% improvement. To further verify the effectiveness of our method, we added an AP comparison of BEV detection. As may be observed from Fig. 9(b), D-S augmentation achieved the best accuracy of 65.19% in the range of 30-50 m. On the one hand, it benefited from the density enhancement of the relatively sparse point cloud in the distance by P-DA. On the other hand, the P-SA method was able to fuse the semantic information of the point cloud and the image effectively. Thus, our proposed approach achieved a relatively good performance by using both modalities to complement each other.
2) Qualitative Analysis: To highlight the effectiveness of our method, we qualitatively analyzed the overall network and visualized the results of the 3-D object detection. We compared the results with the visualizations of 3-D object detection by CenterPoint [7], and the comparison results in the six scenarios were analyzed (Fig. 10). In Fig. 10(a), (b), and (e), the objects in the scene were correctly detected by CenterPoint [7], but the objects in the orange bounding box were missed. Our method not only solved this problem but also detected objects missed by CenterPoint [7]. Fig. 10(c) shows an example of false detection by CenterPoint; the obstacle in the orange bounding box was detected as the object, and our method successfully solved the false detection by CenterPoint [7]. In Fig. 10(d), we performed 3-D object detection on small objects in the scene to verify the effectiveness of our method on different types of objects. Because small objects are extremely sparse in point clouds, they are relatively difficult to detect. In Fig. 10(d), a large number of small objects were missed by CenterPoint [7]. Our method increased the density of point clouds due to the use of point-cloud density (P-DA) and semantic augmentation methods (P-SA). At the same time, the semantic information in the image is drawn to the point cloud to improve the accuracy of small-object detection. Fig. 10(f) shows the 3-D object detection in complex scenes. It may be observed from the orange bounding box in the figure that the CenterPoint [7] method in complex scenes exhibited both missed and false detections. Although our method avoided false detections, it missed some individual objects. Therefore, we plan to study methods to mitigate the problem of missed and false detection and further improve the accuracy for actual scenes with complex, realistic environments.

VI. CONCLUSION
In this article, we proposed a multimodal sensor fusion method to augment the density and semantic information of point-cloud data for 3-D object detection, called D-S augmentation. First, using the proposed approach, we performed object detection and instance segmentation on the image data and projected the point cloud onto an instance segmentation mask to generate a fixed number of random points. Second, we used the global N-nearest neighbor data association method to associate random points and projected points to provide random point depths and complete the density enhancement. Third, we associated an instance segmentation mask with the point cloud, assigned a class label and segmentation score to the projected cloud after the instance segmentation, inversely mapped the projected points with 1-D features to the pointcloud space, and obtained a semantically augmented point cloud. Fourth, we encoded the point cloud with a Gaussian encoding according to a 2-D detection box to further augment the centerpoint feature. Finally, we presented the results of extensive experiments conducted to verify the effectiveness of our proposed approach. The experimental results demonstrate that our D-S augmentation method exhibited significantly improved detection performance compared to the previous state-of-the-art methods on the nuScenes and KITTI datasets. However, considerable room for improvement remains in terms of real-time detection, which provides a useful direction for further optimization in future work.