Effective Backbone Network for 3D Object Detection in Point Cloud

Three-dimensional (3D) object detection is composed of object classification and object localization, and has been used in many applications, such autonomous driving and mobile robot. However, the accuracy of classification and localization is greatly affected by the depth of the network. Shallow networks tend to cause poor classification, but as the depth of the network increases, the network degradation will become more obvious, which is not conducive to the training of network. To solve this problem, a novel Backbone Network is proposed in this paper for 3D object detection in point cloud, which consists of multiple residual modules. Experimental results on the KITTI 3D object detection benchmarks show that Backbone Network proposed can effectively improve the accuracy of 3D object detection.


Introduction
Three-dimensional (3D) object detection is a challenging subject in the field of computer vision, and is widely used in various practical applications, such as autonomous driving [1,2], video surveillance [3], mobile robot [4] and so on. It can be divided into two sub-tasks: object classification and object localization. Object classification is to accurately determine the category of the object, object localization is to give an accurate 3D bounding box of the object in a given scene. In recent years, deep learning has become an important research direction of computer vision with its powerful feature learning ability, and has made substantial progress in many fields, including 3D object detection. Sufficient depth plays a key role in the success of deep learning model in various tasks. The main module of deep learning network structure is a standard non-linear transformation module, which consists of convolution layer, pooling layer and activation layer. The deeper the model is, the stronger its non-linear expression ability is, allowing it to learn more complex transformations and fit more complex feature inputs [5]. Although deepening the network can improve the performance of the model to a certain extent, it does not mean that the performance of the network is directly proportional to the depth, which is mainly reflected in the improvement of performance and optimization. First of all, when the depth of the network reaches a certain level, the performance of network tends to be saturated and will not increase with the increase of depth. In this case, the deepening of the network can only bring expensive time cost. Secondly, the deepening of the model may also lead to the decline in the learning ability of some shallow levels, which limits the learning of some deep levels. Last but not least, the problem of network degradation caused by deep network always exists. Although it can be alleviated, it cannot be eliminated [6]. Therefore, it is possible that with the deepening of the network, performance will begin to decline. IOP Conf. Series: Materials Science and Engineering 711 (2020) 012084 IOP Publishing doi:10.1088/1757-899X/711/1/012084 3 the residual block can solve the problem of network degradation very well. Consequently, it has been applied to various tasks, such as object classification [6], object detection [18] and so on.

Our Network
The architecture proposed will be introduced in this section. As shown in Fig. 2  Data Preprocessing. To apply CNN, it is necessary to convert the point cloud into regular data. Firstly, we crop point cloud using a fixed dimension L × W × H m 3 in the directions of X, Y and Z axes. Then the cropped point cloud is discretized into an evenly spaced grid in the x-y plane with size D x and D y . Consequently, a total L 1 ×W 1 pillars are generated, where L 1 = L / D x , W 1 = W / D y . In addition, to reduce the sampling deviation between the pillars, we use random down-sampling to fix the number of points in each non-empty pillar to a number N.
Feature Learning Network. To obtain a more expressive feature, any point in a non-empty pillar is represented as a vector p i ={x, y, z, r, x-△x, y-△y, z-△z, x-x c , y-y c } (i ≤ N), where (x, y, z, r) represents the coordinates and reflectance of the point itself, (△x, △y, △z) represents the mean coordinates of all points in each pillar, and (x c , y c ) represents the x, y center of the pillar. Then, a simplified version PointNet is used in each non-empty pillar to learn pillar-wise feature. In detail, the simplified PointNet consists of a linear layer, a Batch Normalization (BN) layer and Rectified Linear Unit (ReLU) layer. After that, each non-empty pillar is represented as a 128D vector. And the empty pillars are filled with a 128D zero vector. As a result, the point cloud can be represented by a pseudo-image of size L 1 ×W 1 ×128.  [2,10,22]. However, the deepening of the network is likely to cause network degradation. In order to end the problem, we propose a novel Backbone Network for 3D object detection in the point cloud, and its structure is shown in Fig. 3. It is the unidirectional feedforward computation of 2D CNN, which computes the pyramidal 2D feature hierarchy. Every network layer consists of a convolution layer and a residual block. In particular, the structure and parameters of residual block in each network layer are the same, as shown in Fig. 3. Each residual block is composed of three convolutional layers, and each layer is followed by a BN layer and a ReLU layer. In addition, to aggregate high-level features and high-level features, three deconvolution layers are applied to the output of three convolution blocks. After that, three feature maps with different levels but the same size are obtained, and then these feature maps are inputted to the Header Network for classification and regression.
Header Network. The aim of Header Network is to detection objects from feature maps generated by the Backbone Network. All the 2D feature map will be fused to concatenated together to form a stronger semantic feature map for classification and regression tasks.

Experiment and Results
All experiments are performed on the challenging KITTI benchmarks [13], which contains 7481 training point clouds and corresponding RGB images, covering three categories: Car, Cyclist and Pedestrian. Because the label of the testing dataset is not available, the training dataset is divided into training set (3712) and validation (3769). Moreover, the loss function and the parameters of network will be described in this section. Loss function. The loss of network consists of two parts: classes classification and 3D bounding box regression. As shown in Eq. 1, different weights are used to balance relative importance, where ρ=1 and μ=2.
Classes Classification . For the object classification loss, focal loss proposed by Lin et al. [23] is used, which is defined as follows: 3D Bounding Box Regression . For the 3D bounding box regression task, we use the same loss functions defined in PointPillars [22]. Anchors and ground truths are represented by 7 parameters, i.e. central coordinates, dimensions and rotation angles around the Z axis. Consequently, the regression targets are defined as a vector, (△x, △y, △z, △l, △w, △h, △θ), and details are as follows:   boxes represent ground trues while teal 3D boxes indicated that have been successfully detected. Cyclist and Pedestrian Detection. Compared with the task of car detection, the accuracy of cyclist and pedestrian detection is slightly poor. There are two main reasons: firstly, the network training is insufficient due to the lack of cyclists and pedestrians. In the training set, there were 10520 cars, while only 594 cyclists and 2104 pedestrians. Another reason is that the sizes of cyclists and pedestrians are small and their distribution is dense, which makes detection difficult. This is more evident in pedestrian detection task. However, the accuracy of our method is much higher than other algorithms, especially for the pedestrian, the 3D detection performance of our method is 3.74% higher than that of PointPillars [22] for the Hard level. In addition, we can see from the results of the ablation experiment that shallow network is more suitable for pedestrian detection.

Summary
To solve the problem of degradation caused by network deepening, a novel Backbone Network is proposed in this paper, which consists of several residual modules. The results of the ablation experiments on the challenging KITTI validation set show that the performance our network is better than that of PointPillars [22] under the same network depth. In addition, the results of comparison with other methods also fully demonstrate that the proposed Backbone Network is more suitable for 3D object detection.