1 Introduction

Every day, almost 3700 people are killed globally in crashes involving cars, buses, motorcycles or pedestrians. More than half of those killed are pedestrians, motorcyclists, or cyclists. Crash injuries are estimated to be the eighth leading cause of death globally for all age groups and the first leading cause of death for children and young people. As reported by the U.S. Department of Transportation, more than 90\(\%\) of car crashes are attributed to drivers errors. The adoption of autonomous vehicles is expected to improve driving safety and traffic efficiency. Therefore, an accurate environment perception system is necessary to reduce the traffic accidents.

The intelligent driving is one of the most important developing direction of smart cars and intelligent transportation systems. A safe and reliable self-driving car needs to detect a 3D model of the around objects so that an intelligent driving car should have a perception ability to real driving situations. An intelligent driving car equipped with monocular cameras, stereo cameras, or laser radars can acquire various types of information. The rapid development of these sensors contributes to intelligent driving greatly. Besides, based on 2D object detection algorithms, 3D object detection algorithms have been developing rapidly. Especially in these 2 years, some outstanding 3D detection works that are not inspired by traditional algorithm structures also achieved good performance. The improvement of 3D detection algorithms combined with the intelligent upgrading of traditional cars makes Intelligent driving closer to human lives. The comparison between 2D and 3D object detection is shown in Fig. 1.

This paper is structured as follows. Section 2 describes commonly used sensors and datasets for perception works in Intelligent Driving. Section 3 lists representative frameworks for 3D object detection in autonomous driving. The results of different 3D object detection algorithms are compared and summarized in Sect. 4. Section 5 provides a brief conclusion of 3D object detection algorithms.

Fig. 1
figure 1

The comparison between 2D detection and 3D detection

2 The hardware preparations for intelligent driving

2.1 The technologies demanded by intelligent driving

Self-driving technology involves a lot of technical areas. A mature self-driving car needs different kinds of sensors, high-precision maps, Internet of Vehicles (IoV) and high-performance chips. As is shown in Fig. 2.

Fig. 2
figure 2

The technologies demanded by intelligent vehicles

Self-driving cars are one of the most discussed technologies in modern society. The development of autonomous driving can be divided into 6 levels. From 2020 to 2030, all the manufacturers and research institutes are all pushing the development of self-driving from Level 3 to Level 4.

Level 0: The driver does all the driving: steering, brakes, and power, etc.

Level 1: These cars can handle one task at a time, like automatic braking.

Level 2: These cars would have at least two automated functions.

Level 3: Drivers are still indispensable to intervene if necessary, but are able to shift all the functions to the vehicle.

Level 4: These cars are officially driver-free in certain environments.

Level 5: A fully-autonomous system is expected to function as good as a human driver, coping with various unconstrained driving scenarios.

2.2 The current manufacturers of intelligent vehicles

Autonomous driving has the most important place for artificial intelligence technology being applied to real human life. Especially in the Novel Coronavirus period, non-contact robot applications become particularly important. Many car manufacturers have made good achievements in autonomous driving technology.

In 2020, Waymo made fully driverless taxis commercially available in Arizona, US. For now, Waymo is the only platform to transport passengers by fully autonomous vehicles, and it is seen as an industry leader.

Tesla’s self-driving car has ditched expensive LiDAR to develop Tesla Vision using a combination of cameras. This makes self-driving cars cheaper and more consumers able to experience the latest self-driving technology.

Honda Research Institute plans to launch an L3 autonomous vehicle in 2022 and an L4 autonomous vehicle in 2025.

BMW plans to produce L4 autonomous vehicles by 2024, capable of driving themselves in certain geographical locations, such as highways or dual carriageways.

2.3 3D object detection sensors

In autonomous driving, detection sensors equipped in smart cars are the most important devices that can provide accurate and real-time information for Intelligent Driving systems. Many categories of detection sensors can be used in 3D object detection algorithms, and the most commonly used detection sensors have been listed in Fig. 3. Sensors can be generally divided into three main categories by different kinds of information detected.

First, visual sensors, such as monocular cameras and stereo cameras, can detect color information and pixel-level images. The depth information can also be calculated by the triangulation method. Second, LiDARs and radars can directly measure the depth information and acquire point clouds maps of the 3D scenes. But the point clouds in maps are irregularly arranged and very sparse, which could bring some difficulties in later netural network calculations. The third category of sensors is the fusion of the above two categories. The fused sensors can provide more comprehensive data, including pixel-level images and point cloud maps. Theoretically, the more information provided by sensors, the higher detection accuracy could achieve by detectors.

Fig. 3
figure 3

The comparison of different sensors

2.4 3D object detection datasets

Since 2012, an increasing number of datasets has been created to train and evaluate the object detection frameworks. In 3D object detection research field, datasets can be divided into indoors and outdoors according to the application scenes. This paper surveyed the 3D object detection frameworks mainly used under Intelligent Driving scenes, so this section highlights several widely used outdoor datasets, especially autonomous driving datasets, as is shown in Table 1.

These well-known 3D object detection datasets listed here tend to use RGB cameras and LiDARs to detect different kinds of data. These fusion datasets can not only provide colorful and pixel-level images, but also provide point cloud maps with depth information. And all the datasets contain many different self-driving scenes, multiple object classifications and a mass of 3D annotation bounding boxes, which could ensure adequate training and evaluation data for 3D object detectors. A few of the most commonly used datasets are introduced as below.

Table 1 A summary of 3D object detection datasets for intelligent driving

2.4.1 KITTI

The KITTI dataset is the largest computer vision algorithm evaluation dataset in the world for autonomous driving scenes. The dataset is used to evaluate the performance of computer vision technologies such as optical flow, visual odometry, 3D object detection, and 3D tracking in autonomous driving environments. KITTI contains real-world image data collected from scenes such as urban, rural and highways. One image in KITTI could obtain up to 15 cars and 30 pedestrians. The original datasets are classified as Road, City, Residential, Campus and Person. For 3D object detection, labels are subdivided into car, van, truck, pedestrian, cyclist, tram and so on.

2.4.2 ApolloScape

ApolloScape dataset was provided by Baidu Inc. The 3D LiDAR object detection and tracking dataset consist of point clouds and high-quality labels. It is collected under various lighting conditions and traffic situations in Beijing. More specifically, ApolloScape dataset contains very complex traffic flows mixed with vehicles, cyclists and pedestrians.

2.4.3 H3D

H3D is a point cloud dataset for self-driving scenes provided by Honda. H3D dataset contains HDD dataset, which is also a large-scale natural driving dataset collected in the San Francisco Bay Area. H3D has a complete 360 degree lidar dataset with 3D bounding box labels. The dataset also contains vedio data, manually annotated from 2 to 10 Hz linearly.

2.4.4 Waymo

Waymo is a self-driving car company owned by Google Inc. Waymo company announced Waymo Open Dataset, a bonus-based benchmark compared to previous academic benchmarks. In addition, Waymo contains 3000 driving records, 600,000 frames, with approximately 25 million 3D bounding boxes and 22 million 2D bounding boxes. Waymo has a huge amount of data in various autonomous driving scenes.

2.4.5 Lyft Level 5

Lyft Level 5 is also well known like KITTI dataset. Lyft Level 5 dataset is acquired by 64-wire radars and multiple cameras. The Level 5 dataset includes over 55,000 human-labeled 3D annotated frames, surface maps, and an underlying HD spatial semantic maps.

3 The classification of 3D detection frameworks

We divide 3D object detection methods into three categories: vision based, point clouds based and multi-sensor fusion based methods. An overview of methodologies, advantages and limitations for these methods are demonstrated below. The following subsections address each category individually, which are shown in Fig. 4.

Fig. 4
figure 4

The classification of the 3D object detection frameworks

3.1 Vision based 3D detection framework

Vision sensors, especially monocular cameras, are the most important devices for 2D object detection. Meanwhile, stereo cameras and depth cameras are more suitable to detect a 3D object in real world for their ability of measuring distance. Depth information is the fourth image channel that contains information about the distance from the surface of the object to the viewpoint. Depth Map is similar to a gray image, but each of its pixel values is the actual distance the sensor is from the object. Usually, the RGB image and the Depth image have a correspondence relationship from pixels to pixels.

This subsection focuses on the frameworks that could estimate 3D bounding boxes based on RGB images and depth information. And the frameworks are mainly divided into 2 parts based on the different cameras.

3.1.1 Monocular camera frameworks

3D object detection from a single image without lidar is a challenging task due to the lack of accurate depth information. Traditional 2D convolution frameworks are unsuitable for this task because they cannot capture tridimensional objects and their scale information, which are important for 3D object detection. To better represent 3D structure, Ding [1] transformed depth maps estimated from 2D images into a pseudo-Lidar (PL) representation. And then use the existing 3D object detectors to detect objects. This method can predict point clouds information from a normal image pictured by monocular cameras. But there is definitely an error existing between predicted point clouds and real point clouds.

3.1.2 Stereo camera frameworks

The 3D object detection using LiDAR is always much more accurate than those using cameras. While PL has led to a drastic reduction in the accuracy gap between methods based on LiDAR sensors and those based on cheap cameras. PL combined advanced deep neural networks for 3D depth estimation with those for 3D object detection together, and performed well.

However, PL’s two networks need to be trained one by one. To obtain an end-to-end PL framework, Qian [2] introduced a new framework based on differentiable Change of Representation (CoR) modules that can allow PL to be trained end-to-end. The final framework achieves better results than PL in combination with Point R-CNN. This work gets the highest entry on the KITTI image-based 3D object detection leaderboard at the submission time in 2020. Although, stereo cameras could measure the depth information by disparity map. While many recent works still try to recovery point clouds with disparity estimation and then apply a 3D detector based on LiDAR sensors.

In 2020, Sun [3] proposed a new framework called Disp R-CNN for 3D object detection using stereo images. To address the challenge from scarcity of disparity annotation in training, Sun J. proposed to use a statistical truth without the need for LiDAR sensors, which reduced much of the cost in Intelligent Driving. The disparity map is computed for the entire image, which is computationally costly. So Sun J. designed an instance disparity estimation network (iDispNet) that predicts disparity only for pixels on objects of interest and learns a category-specific shape prior for more accurate disparity estimation. Disp R-CNN achieves competitive performance and outperforms the average precision of the lasted advanced methods by 20%.

In the past several years, as is known to all, LiDAR-based 3D detection frameworks perform much better than visual-based methods. But 3D detection frameworks using LiDAR sensors could not extract semantic information without the help of images. So, Chen [4] proposed a method called Deep Stereo Geometry Network (DSGN). Significantly reduces this gap by detecting 3D objects on a differentiable volumetric representation: 3D geometric volume, which effectively encodes 3D geometric structure for 3D regular space. With this method, depth information and semantic information could be acquired simultaneously. Chen Y. provides a simple and effective one-stage stereo-based 3D detection pipeline that jointly estimates the depth and detects 3D objects in an end-to-end learning manner. DSGN approach outperforms previous stereo-based 3D detectors and even achieves comparable performance with several LiDAR-based methods on the KITTI 3D object detection leaderboard.

3.2 LiDAR based 3D detection framework

Currently, the point clouds-based methods achieve state-of-the-art performance in 3D detection methods. Due to the accurate depth measurement of point clouds dataset, many frameworks, which used 3D-based methods to regress 3D bounding boxes, with high detecting accuracy. But 3D based methods also will bring a lot of computation burdens. While the shortcoming of the great amount of computations becomes a challenge to real-time self-driving applications. So many researchers tend to use the 2D-based detection algorithms to realize the object detection in 3 dimensions. This subsection divided the LiDAR-based frameworks into categories: 2D indirect method and 3D direct method.

3.2.1 2D indirect method frameworks

Martin et al. [5] used the YOLOv2 detection network to realized a 3D object detection framework in 2018, called Complex-YOLO. And this work achieved an excellent detection speed to meet the real-time 3D detection in self-driving. This work transformed the 3D point clouds into BEV (Bird’s Eye View) images, including the height, density and intensity messages. The main idea of this work is to transform the 3D point clouds data into 2D image shapes, and then use the 2D detectors to complete the 3D object detection.

Li et al. [6] proposed a 2.5D object detection framework called 6DOF-YOLO in 2021. This work also transformed the 3D point clouds into BEV (Bird’s Eye View) images. But the difference from the Complex-YOLO is that Li’s work did a 2.5D detection, which could regress the height of vehicles. While Complex-YOLO actually did a 2D detection and then used point clouds message to calculate 3D bounding boxes. Li’s work also used the state-of-the-art YOLOv4 [7], getting a great balance between the detecting efficiency and the accuracy.

3.2.2 3D direct method frameworks

In 2018, Shi et al. [8] proposed a 3D detector called PointRCNN, which is a two-stage 3D point cloud target detection framework. By dividing the point cloud of the whole scene into foreground and background, the stage-one network directly generates a small number of high-quality 3D proposals from the point cloud. The stage-two network refines the proposals in the canonical coordinate by combining semantic features and local spatial features. PointRCNN actually used PointNet [9] to complete a 3D convolution, which has a great improvement in accuracy, but the efficiency is not mentioned.

In 2019, Yang et al. [10] present a new two-stage 3D object detection framework, named sparse-to-dense 3D Object Detector (STD). This two-stage network used PointNet++ network as the backbone for 3D detection. Novelties are a new spherical anchor and 3D intersection-over-union (IoU), which achieves a high recall and the localization accuracy. This method outperforms other state-of-the-arts, especially on the hard dataset, but with an inference speed of only around 10 FPS.

Table 2 3D object detection benchmark on KITTI test set

3.3 Sensors fusion framework

To integrate all the advantages of different sensors, fusion technology plays an important role to balance the different characters from different sensors. Up to now, multi-sensor fusion 3D detection methods are classified into two kinds: Single-stage Fusion and Multi-stage Fusion.

3.3.1 Single-stage fusion

Single-stage fusion means that different features from different sensors will be fused only one time at the feature extraction layer or at the data preprocessing stage.

One of the representing works is PointPainting [11]. Instead of end-to-end training in the design of fusion network structure, it is divided into two stages: the first stage is semantic segmentation of camera data; the second stage is 3D target detection by combining LiDAR point cloud with semantic information. The two stages are connected by feature projection.

3.3.2 Multi-stage fusion

Multi-stage fusion means that different features from different sensors will be fused many times at different feature extraction layers during the whole fusion algorithm process.

ContFusion [12] is an example of Multi-stage fusion method. Compared with single-stage fusion, this method is more complex, mainly in the extraction of fusion features. The overall network structure is an end-to-end network, which is divided into four parts: camera feature extraction, feature fusion, LiDAR feature extraction and detection output. The character of this design is the fusion of the two kinds of information is conducted in every feature detection layer. So the detection accuracy could be extremely high.

Chen [13] proposed a two-stage network of 3D detection called MV3D in 2017. MV3D uses point clouds and images as input. The whole network can be divided into two stages: the first is a 3D Proposal Network to generate the Region of Interest (ROI). The second is a Region-based Fusion Network to fuse the LiDAR Bird View, LiDAR Front View and images data. MV3D comprehensively analysed the impact of different fusion stages on the detection results. And MV3D finally used the multi-stages fusion in the different layers of the Fusion Network.

4 Summary

As we can see from Table 2, the results of the frameworks mentioned are evaluated using a new evaluation metric on the KITTI leaderboard. Several methods undergoing old evaluation are not available on the leaderboard. First, generally comparing, all the Bird’s Eye View (BEV) detections have a better performance than the 3D Detection in Average Precision (AP). Secondly, as we can see from this table, frameworks using different sensors also have a huge gap in AP. Visual sensors-based frameworks always get the lowest accuracy, while the multi-sensor fusion frameworks achieved state-of-the-art performances. But when we take the sensors’ prices into account, multi-sensor fusion frameworks would cost the most in these three kinds of works, and the camera sensors are the least expensive. The most excellent visual 3D detection frameworks have achieved results that are greatly close to the best performance. So Intelligent Driving systems using visual sensors will have a great prospect for future development.

In February 2021, the California DMV (Department of Motor Vehicles) released new autonomous driving data for the year 2020. The California DMV annual Miles Per Intervention (MPI) is one of the key measurements of autonomous driving, reflecting the average miles of each intervention annually. It is widely regarded by the industry as a more objective, quantitative and accurate measure than driving experience test. As the analyze of the different intelligent vehicles manufacturers, and along with the excellent research and development institutions, the best testing results until now are listed in Table 3. As the Table shows, the top 10 testing results are listed here, and the intelligent driving technology of Waymo company possesses the best performance.

Table 3 The testing results of California self-driving car disengagements

This paper also summarized some shortcomings of the 3D object detection systems applied in autonomous driving. First, in the practical application of driverless vehicles, obstructing and surrounding environment confusion will lead detectors to miss the target object in the traffic. Second, there will be a great amount of data to process in the traffic, while most algorithms cannot meet the requirements of real-time in autonomous driving. Third, in unmanned driving, the real-time performance of 3D target detection is difficult to be guaranteed and the accuracy of 3D target detection is not high enough for safe self-driving.

The final goal of Intelligent Driving is that the self-driving skills need to exceed the average driving skill of human. So, Intelligent Driving needs to use more accurate 3D object detection algorithms and more faster processing chips to make the self-driving cars more security and more fluency. Frameworks using vision cameras are the most competitive methods because of the low cost and potential for higher detecting accuracy. It is believed that self-driving cars using cameras only have the capacity to ensure security because human beings use their eyes only. The most important thing for self-driving is to possess a smart brain–a smart 3D detector.

5 Conclusion

This paper systematically surveys the development of 3D object detection methods applied to intelligent driving technology. This paper also discusses the different types of information acquired by different sensors, such as RGB images, point cloud maps, and multi-source information fusion. After analyzing the effective and representative 3D detection works in details, this paper summarized these works into different types according to the different realizing algorithms. The experiment data of different kinds of 3D detection frameworks are discussed and compared on the KITTI Benchmark. And the practical application of these detectors in real traffic environments is also analyzed. The relationship of the detecting accuracy and the demanding accuracy for safe driving is also researched. This paper finally talks about the shortcomings of the existing 3D object detection frameworks and the future development directions on Intelligent Driving.