Combined Auxiliary Networks and Bird’s Eye View Method for Real-Time Multicategory Object Recognition

Object recognition based on LIDAR data is crucial in automotive driving and is the subject of extensive research. However, the lack of accuracy and stability in complex environments obstructs the practical application of real-time recognition algorithms. In this study, we proposed a new real-time network for multicategory object recognition. The manually extracted bird’s eye view (BEV) features were adopted to replace the resource-consuming 3D convolutional operation. Besides the subject network, we designed two auxiliary networks to help the network learn the pointwise features and boxwise features, aiming to improve the category and bounding boxes’ accuracy. The KITTI dataset was adopted to train and validate the proposed network. Experimental results showed that, for hard mode, the total average precision (AP) of the category reached 97.4%. For an intersection over a union threshold of 0.5 and 0.7, the total AP of regression reached 93.2% and 85.5%; especially, the AP of car’s regression reached 95.7% and 92.2%. The proposed network also showed consistent performance in the Apollo dataset with a processing duration of 37 ms. The proposed network exhibits stable and robust object recognition performance in complex environments (multiobject, unordered objects, and multicategory). And it shows sensitivity to occlusion of the LIDAR system and insensitivity to close large objects. The proposed multifunction method simultaneously achieves real-time operation, high accuracy, and stable performance, indicating its great potential value in practical application.


Introduction
Autonomous driving is a futuristic technology that will transform mobility industries and ease the burden of driving. Autonomous driving is currently supported by relatively mature planning, decision-making, and algorithm implementation but is mainly hindered by its poor perception. As an efficient and precise remote sensing technique, the LIDAR systems have been widely applied in realtime intelligent systems, such as self-driving vehicles [1,2]. e data acquired from LIDAR are point clouds, which is a set of points containing coordinates and other feature-related information, such as reflectivity. Detecting objects accurately within a point cloud is crucial and has been a widespread research subject. However, the key challenge is that the raw point cloud data are irregular, unstructured, and unordered. Consequently, specific processing methods that require data with a regular form are not suitable for direct application. e convolution operation is an efficient approach for extracting deep features [3][4][5][6][7], and it requires a regular grid as the input, which a point cloud does not satisfy. erefore, the first step is to transform the unstructured point cloud into a regular style. e structured processed data can be graphical [8][9][10][11][12], ordered points [13][14][15][16][17], or voxels [18][19][20][21][22]. In graph-based methods, nodes represent points, and edges represent the relationships between points. e abstract expression is obscure. Although pointbased methods can achieve better performance by taking the raw point clouds as their input and predicting bounding boxes based on each point, in general, their inference time cannot meet the demands of a real-time system. erefore, they are restricted primarily to offline analysis. Voxels are popular because they have a clear physical structure similar to images. VoxelNet [20] is an example of a classic voxelization method that performs impressively in 3D object recognition tasks. Its strong performance relies heavily on several 3D convolution operations, resulting in a time-and memory-consuming process. To avoid using 3D convolution operations, a structure that replaces 3D voxels with pillars, thereby erasing the vertical dimension, has been reported [21]. is method was also called the bird's eye view method, which led to an improved processing speed, although the performance was unstable due to the lack of vertical information. Due to the lack of color information, the unstable performance is more serious than the image recognition task [22]. Alternatively, it should be a compromise approach to keep the necessary information concisely by using the maximum height, the density of the point set, and the reflectivity of the highest point to express the pillar feature [23].
To achieve higher precision of the bounding boxes, RBG images are fused with LIDAR data [24], thus obtaining a richer expression of the environment. e introduction of camera data means that this method is based on the trigger consistency of two kinds of data [25,26] and the calibration accuracy of the camera and the LIDAR coordinate system, which may cause robustness problems in practical applications. Inspired by the better performance of point-based methods, an alternative method involves aggregating the voxels into a small number of key points [27], thus combining the advantages of both voxel-and point-based methods. In addition, this study adopted the farthest point sampling (FPS) to sample key points. However, FPS is extremely time-consuming, specifically for a large-scale scene, and the sampling time is not discussed in [27]. erefore, finding an optimal balance between performance and processing time is still a challenge.
Most researchers use only a single category of data when training networks and assign independent evaluation indexes for the recognition effect of single categories. is method excludes the interference of other types of categories in the result. Furthermore, it causes deviation from the requirement that results in the recognition of multiple categories through one forward propagation in the actual application, which cannot explain the actual effect of the application.
is present study focused on developing a LIDARbased 3D object recognition method for road scenes. Considering the significant effect of image recognition, we expect to take the advanced image recognition methods to the point cloud recognition task. Hence, the proposed method is a voxel-based recognition method that can simultaneously predict multiple object categories. We evaluated the method based on the 3D localization and bounding box precision, object recognition accuracy, and processing time. Unlike most present methods that heavily rely on 3D convolutional operations, we considered that the bird's eye view (BEV) based method has not yet exhausted its performance potential. us, we improved the head network and designed an additional auxiliary network to improve the prediction accuracy. e network was trained and evaluated by the KITTI dataset and its benchmark. e results verify that the new part is beneficial to the network. e rest of this paper is structured as follows: Section 2 presents the proposed network architecture; Section 3 outlines the implementation of the proposed network and presents the results; Section 4 discusses the specific recognition effects that are not obvious in the evaluation indicators; and Section 5 presents our conclusions.

Methods
e proposed network is divided into preprocessing, backbone network, neck network, head network, and auxiliary networks: (1) e preprocessing stage transforms the unordered point cloud into ordered data. (2) e backbone and neck networks are used to extract scene features. (3) e head network transforms the scene features into predicted outputs. (4) e auxiliary network is set up to help the subject network learn pointwise and boxwise features. It does not participate in the prediction process, so it will not cause an additional computing burden to the network.

Preprocessing.
e method outlined in [23] is referred to. First, the irregular points are transformed into a pillar map according to their location. Besides the three channels mentioned in [23], we add a channel containing the pillar's minimum height, which expresses the difference between the edge and the inside of an object. erefore, four channels represent the vertical distribution of the points in each pillar: the first channel records the number of points in the pillar; the second and the third record the maximum and minimum vertical coordinates of the points in the pillar; and the fourth records the reflectivity of the highest point in the pillar. Finally, a four-channel bird's eye view (4C-BEV) is obtained as the network input. is method is essentially equivalent to taking the upper cover shell of the spatial point cloud from a top-down perspective. Because of Earth's gravity, very few objects are suspended in the air, and obstacles can usually be clearly distinguished by direct observation of such shells. e channels' values need to be normalized, specifically, the first channel, because the point cloud density increases from far to near. Here, the distance factor K d is added to make pillars with a similar degree of characteristic expression at different locations: where N p represents the number of points in each pillar and K d is expressed as where k e is the coefficient. As shown in Figure 1, through observation with the naked eye, the objects are visible in the 4C-BEV, indicating that this method can preserve the point cloud's information in a vertical direction while compressing the data efficiently.

Backbone Network.
e 4C-BEV is entirely consistent with the image in terms of data structure. erefore, many popular backbone networks for image recognition can be used directly, such as ResNets [28], CSPDarknet53 [3], or VGG16 [29]. Besides, there are some differences between the object recognition tasks in image and point clouds. First, multiple scales are not necessary. In the image recognition task, the perspective phenomenon is one of the main factors requiring consideration in the network design. erefore, the network contains output nodes representing different scales, or several preexisting boxes are predefined to represent different scales. When constructing the feature map using the LIDAR coordinates, as the objects have the same size as the real world, this perspective phenomenon is not encountered. Second, the scales of objects are different. In the image recognition task, in general, the recognition performance varies between large-and small-scale objects when using the same network. Typically, the area of interest appears near the observer, which means the identification accuracy of large targets is more important than others. e image passes through a multilayer network, which significantly reduces its scale and improves the recognition ability regarding large-scale objects. Taking CSPDarknet53 [3] as an example, after an input image was transmitted forward, the scales of the three outputs were reduced by 8, 16, and 32 times, respectively. Using the mentioned encoding approach, the feature vectors at each position can fuse with the features of the broader receptive field, identifying large-scale objects. However, for LIDAR-based tasks, the scene's object is relatively small compared to the scene size, with the largescale output feature map affecting the recognition accuracy.
ird, the orientation of the bounding boxes needs to be predicted. e maximum pooling layers play an essential role in a backbone network because they can prevent overfitting and improve the network's generalization ability; however, they can also enhance the rotation invariance. e backbone network architecture is more similar to a tiny version of CSPDarknet53. Figure 2 illustrates the modified architecture. We use Conv (k, s, p, c out ) to represent a 2D convolutional operator, where c out is the number of output channels; k, s, and p are the kernel size, stride, and padding size, respectively. e "Conv" operation contains a 2D convolutional operator, a group normalization (GN) layer, and an activation function layer sequentially when it acts as a convolutional middle layer. We used several small residual blocks to fuse features of the current layer and the previous layer. en, we used big residual blocks to fuse shallow features and deep features. e nodes of the backbone network measure h × w × c, where h and w are the spatial dimensions, and c is the channel dimension. e input is a feature map with a fixed size of h × w × 3. e backbone network has two outputs: one with a fixed size of h/2 × w/2 × 128, while the other has a fixed size of h/4 × w/ 4 × 512.

Neck Network.
e role of the neck network is to perform further feature extraction and connect the backbone to the head. Figure 3 shows the architecture of our neck network.
e residual blocks are retained to aid further feature extraction. Upsampling operations are used to unify the scale of the feature map. Although there is no perspective phenomenon, different categories of objects have different sizes, and multiscale features play a positive role in the network.

Head Network.
e head network is custom-designed for our specific 3D object recognition task and divided into three parts. e first part is used for confidence prediction, with the sigmoid function used to limit the result range to [0, 1]: Two channels are assigned to each category, representing the regression confidence of this category based on horizontal and vertical anchors. e second part is used for predicting bounding boxes. e spatial position and physical dimension are predicted in this part. As there is no perspective effect, it is reasonable that bounding box regression based on the standard reference value should arise. erefore, we predefined an anchor map as the standard reference value in which each position has 2 × N c anchors, where N c is the number of predicted categories. In general, the orientation and border predictions are conducted simultaneously [20,21,23]. is method cannot express the close relationship between the two ends of the interval. Inevitably, they produce the greatest divergence, which is incorrect. To keep the prediction of orientation continuous, we adopt an anchor-free [30] and anchor-based [20,21,23] combined method. Six channels are assigned to represent the regression parameters (except for orientation) of the two anchors at each position. Furthermore, the sine and cosine values are used to represent the orientation indirectly.
In most studies, there is little discussion on multicategory object prediction. By default, when predicting multicategory objects, the regression parameters for all categories are given, which leads to low information utilization (only 1/N c information is useful).
us, the convergence efficiency is greatly affected. In this study, it is designed to give only a set of border predictions at each position. e box center's category is determined according to the ground truth. e other positions' categories are determined by the overlap between the standard anchor and the ground truth bounding box.

Mathematical Problems in Engineering 3
For convenience, it is assumed that the category is determined, and there are two anchors for each position. e ground truth of the bounding box regression value R gt of one anchor at each location can be expressed as follows: R gt � Δx, Δy, Δz, Δh, Δw, Δl, sin θ gt , cos θ gt T , A � x a , y a , z a , h a , w a , l a , θ a T ,  where A denotes the parameter of one anchor in each position. e third part is used for category prediction, for which N c channels are assigned. e softmax function is used to transform the result to N c probabilities, whose range is limited to [0, 1]: 2.5. Auxiliary Network. Because of the ability to obtain more detailed pointwise characteristics, point-based methods usually achieve higher accuracy than voxelbased methods. To enhance our method's accuracy, the pointwise feature was introduced to the network. Referenced by the SA-SSD [31], the pointwise feature learning network was set as an auxiliary network that only works during training, does not play a role in predicting, avoiding additional computational overhead caused by the additional feature extracting. e penultimate layer of the neck network was set as the former feature extraction layer of the auxiliary network, which is ultimately a voxelwise category prediction network. e auxiliary network is elaborated in Figure 4. e accuracy of border regression is highly dependent on the accuracy of category prediction. erefore, the primary task is to improve the accuracy of object category prediction by the increased category information of the point cloud. Unlike the category prediction part in the head network, which only focuses on the category prediction of the bounding boxes' center voxels, the auxiliary network focuses on the category prediction of the voxels around the bounding box center. Since each voxel contains only one highest point, voxel features are equivalent to pointwise features. We randomly extract no more than 1000 internal points and no more than 250 external points of bounding boxes to save memory space. We recreate the voxel category label, depending on whether its highest point is within the bounding box. e whole operation is similar to an additional "droop-out" process, which improves the generalization performance of the network. e second task of the auxiliary network is to enhance the accuracy of bounding box regression. In this step, we randomly sample no more than 50 highest points within each bounding box and calculate the inverse distance to the bounding box center as its weight. e weighted average and maximum pointwise features among all points in the bounding box region are combined to express a boxwise feature: 2.6. Loss. e loss contains the central part and the auxiliary part. e central part contains confidence loss, regression loss, and category loss. e auxiliary part contains point category loss and box regression loss. We adopted the smooth L1 function [5] to calculate the bounding box regression loss:

Mathematical Problems in Engineering 5
SMOOTH L1 X p , X gt , a � n smooth L1 X p − X gt , a n , Smooth L1 function has stable convergence characteristics in the case of large deviation and adequate convergence in small variation. e predictions of category and confidence are converted into the probability value prediction within the interval of [0, 1]. e cross-entropy function was applicable to calculate their losses: where X p , X gt are the predicted value and ground truth value, respectively. e ground truth values of the category are labeled as a one-hot form. e focus loss [32] can solve the problem that when the proportion of positive and negative samples is unbalanced, the negative ones are submerged in the positive ones. Although the positive and negative labeled data distributions are incredibly uneven, the ratio of positive and negative samples is given. To avoid the focus loss affecting the rate of convergence, we do not adopt focus loss.
Not all losses in each position are calculated in a feature map. Some grids that are far from the center of the object are inaccurate and can be neglected. e positive confidence label is vital because it can be used as a mask to filter out untrusted data not to be included in the loss calculation. In Section 2.4, an anchor map was established. We excluded the angle parameters and determined the confidence by calculating the intersection over union (IoU) between the ground truth bounding box and the anchors in the map. Because the confidence feature map does not rely on the vertical direction position, the projection plane in the vertical direction of the ground truth bounding box and anchors are used when calculating the IoU: respectively. e red point is the center of the ground truth bounding box. e boxwise feature is then followed by two fully connected layers (the operation is called "Dense" customarily), generating bounding box regression values similar to R gt (see equation (4)). 6 Mathematical Problems in Engineering e final loss L is defined as Among the final loss, the confidence loss is expressed as where P p represents the positive confidence prediction, P gt denotes the positive corresponding ground truth, and N gt denotes the negative corresponding ground truth. e regression loss is expressed as where R p is the bounding box regression prediction and R gt is the corresponding ground truth. e category loss is expressed as where M gt is the maximum last channel value of P gt . e auxiliary parts of the loss are defined as where W b denotes the weights calculated by equation (12). B p , C p p denote the boxwise regression feature and category prediction of the sample point; and B gt , and C p gt denote the corresponding ground truth, respectively.

Dataset.
Most 3D object recognition networks are trained using the KITTI dataset [33]. e KITTI dataset contains 7481 frames, among which we selected 2000 frames as the verification set and the remaining 5481 frames as the training set. We were interested in cars, trucks, vans, pedestrians, and cyclists among the object categories. Besides, trucks and vans were merged into one class. In this study, the Apollo dataset [34] was also adopted.
In contrast to the KITTI dataset, the Apollo dataset contains continuous frame data. When a vehicle turns, the surrounding objects show disordered orientations and spatial positions, which are more complicated than those in the nonturning state.
is representative disordered data frequently occurs in continuous frame data. e Apollo dataset contains 16 scenes. Each scene contains 2-5 sections of continuous frame data collected at a frequency of 2 Hz, lasting for 1 min. We take a section of each scenario as the verification set and the rest as the training set. e final training set consisted of 3,943 frames, while the validation set consisted of 1,650 frames. We were interested in four types of labeled data: small vehicles, big vehicles, pedestrians, and riders (i.e., motorcyclists and bicyclists), which were labeled using the abbreviations "VEH, TRU, PED, CYC," respectively, during the visualization step. e data augmentation technique [34] was adopted during the training process.

Details.
In this study, points inside the range covered from 41.6 m in front, 20.8 m left and right, and 2 m above and below the LIDAR coordinate system were used to construct the BEV feature map. e resolution of the grid was set as 0.2 m. erefore, the BEV feature map was divided into 208 × 208 grids.
A minibatch gradient descent was conducted with a batch size of 1. We placed all batch normalization layers in the network with the group normalization layers because of the small batch size. Each training set involved in the training was defined as an epoch. Epochs were set as 100 and the first 75 epochs had a learning rate of 0.001, while the remaining epochs had a learning rate of 0.0001. All algorithms were run on a workstation with a Core i7 CPU, 8G RAM, an NVIDIA 1080Ti GPU, and the open-source deep learning framework Tensor-Flow. Nonmaximum suppression was deployed to filter out excess bounding boxes, with the IoU threshold set as 0.1.

Evaluation Indicators.
It is assumed that n p objects are predicted with n gt objects labeled in the ground truth. First, the prediction object needs to be paired with the ground truth object by calculating the IoU. Considering the predicted objects as the benchmark and matching all predicted objects with the labeled objects using the maximum IoU, the calculated result is denoted as the precision. Similarly, considering the labeled objects as the benchmark and matching all labeled objects with the predicted objects using the maximum IoU, the calculated result is denoted as the recall. As the recall rises, the precision drops. Using recall as the horizontal axis and precision as the vertical axis, the area surrounded by the plotted recall-precision curve and the coordinate axis is the average precision (AP), which is widely adopted to evaluate the performance of the network. We set the parameters consistent with the KITTI benchmark, where the IoU threshold of cars is 0.7, and pedestrians and cyclists are 0.5.

Loss Curves.
Validation was performed using one batch of data in the validation set per 100 iterations. During the training process, the loss function fluctuated violently with the weight decline of 0.99, representing the changing trend of the loss. e evolution of the loss curves throughout the training process is shown in Figure 5. By the time the Mathematical Problems in Engineering iteration reaches 500,000, the network has converged. Our trained model was marked as red points in the figures.

Speed.
e acquisition frequencies commonly used by LIDAR are 5, 10, 15, and 20 Hz. e entire recognition process is divided into preprocessing and inference. e mean preprocessing and inference process duration was approximately 5.7 and 31.2 ms, respectively. e mean recognition process duration was approximately 37 ms, which meets the real-time requirement.

Accuracy.
e recall precision curves (trained by the KITTI dataset) are given in Figure 6. e APs of the regression and category are listed in Table 1 (trained by the KITTI dataset). e data we mainly focus on is marked in bold.
e AP for cars was 92.5, which is relatively high, considering that the regression is based on the anchor determined by the category prediction. e trucks, which are not considered in the KITTI benchmark, are included in the identification. Due to the uneven distribution of training data, the AP of other categories is slightly lower. e total AP (0.5), the total AP (0.7), and the total AP (categories) can reach 93.2, 85.5, and 97.4, respectively. In Apollo dataset, the labeled bounding boxes are the objects' visible parts, which vary from the actual physical size, resulting in low indicators.
As for apparent objects, using the Apollo dataset shows similar performance to the KITTI dataset, which is described in Section 3.5.

Comparison.
e main contribution of this paper lies in the anchor-free and anchor combined prediction method and auxiliary networks specially designed for the bird's eye view network. e contrast effect is shown in Table 2. Due to the use of the data augmentation technique, the training results can be slightly different. Indicators within ±1% were regarded as the same performance. Our proposed method significantly improves the prediction results of pedestrians and cyclists, and the data we mainly focus on is marked in bold.

Scene Analysis.
e method described in this paper was designed for a complex environment (multiobject, unordered objects, multicategory). In this section, some typical scenes are selected to analyze the network's performance concerning object recognition. e continuous frame data from the Apollo dataset provides sufficient verification of the stability of the recognition effect. Figure 7 shows three frames of a congested traffic scene. is scene features many vehicles in a dense array, precisely what the proposed network has been designed to identify.   Figure 8. Most objects are recognized accurately, and the performance remains stable. Despite the large size of the rotation scene, the direction of the objects was also predicted accurately. Figure 9 depicts a scene containing several objects belonging to different categories, including a vehicle, truck, and cyclist. Each object was recognized consistently and accurately across the consecutive frames. e recognition visualization of the KITTI dataset is shown in Figure 10, and a series of typical complex scenarios are selected, including multiobject, unordered objects, and multicategory.

Discussion.
e orientation of the bounding box is expressed by sine and cosine values indirectly in this paper. Compared with the directly predicted method, when the      predicted value is not accurate, it will not deviate wildly. It is the advantage of continuous prediction, which makes the network robust. However, because of an additional prediction dependence in the calculation, the accuracy will worsen if only the prediction results with high quality are compared. Similarly, in the proposed multicategory object recognition network, the bounding box regression is also highly dependent on the category prediction, making the regression not achieve the highest precision, but makes the prediction more robust. Our approach is not very sensitive to large objects that are very close to the observer. It is the reason that the AP of cars for easy mode was slightly worse than that for the moderate mode. Expanding the receptive field can alleviate such problems, but it will increase the depth of the network. Our experimental results showed that deeper networks increase the inferencing time but have no significant effect on accuracy, which is very different from image recognition tasks. e indicators in this paper are much higher than those on the KITTI ranking list. e main reason is that the scope of perception selected in this paper is smaller than the standard. Our range covers 41.6 m (70.4 m in standard) in front, 20.8 m (40 m in standard) left and right, and 2 m above and below the LIDAR coordinate system, which has met our application's commands.

Conclusions
e main aim of this study was to design a LIDAR-based object recognition method for autonomous vehicle systems.
us, we proposed a new multifunctional network that operates in real-time with high accuracy and stable performance. As several recognition methods achieve considerable performance differences in different datasets, the Apollo dataset was also adopted besides the KITTI dataset in this study, making the validation results more consistent with actual application scenes. Hence, the proposed recognition method has a high practical value. e key findings of this study are outlined below: (1) e proposed network realizes the accurate recognition of multiple types of objects in real time. (2) To tackle the inaccurate category prediction, an auxiliary network was designed to help network auxiliary learning pointwise features. It is not limited to the object's center point category, making the prediction result more robust. (3) To tackle the inaccurate bounding box prediction, firstly, the validity of the indirect expression of orientation angle by sine and cosine values is verified. Besides, another auxiliary network was designed to help network auxiliary learning boxwise features. (4) e proposed network delivers a stable and robust object recognition performance in complex environments (multiobject, unordered objects, and multicategory), reflecting its high practical value.
(5) e proposed network's performance is impacted negatively when the LIDAR system is obscured and is not sensitive to large objects that are very close to the observer. Further research is necessary to address this weakness of the network.
In this study, we have considered several possible problems in practical application scenarios. Although our proposed method needs to be further improved, it has demonstrated a very high practical application potential. Based on the phenomenon that most current methods rely heavily on the 3D sparse convolutional operation [35], our research's stable performance showed that artificial bird's eye view features can do the same thing as three-dimensional convolution.
Data Availability e data used to support the findings of this study are available upon request to the corresponding author.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.