A Lightweight Model for 3D Point Cloud Object Detection

: With the rapid development of deep learning, more and more complex models are applied to 3D point cloud object detection to improve accuracy. In general, the more complex the model, the better the performance and the greater the computational resource consumption it has. However, complex models are incompatible for deployment on edge devices with restricted memory, so accurate and efﬁcient 3D point cloud object detection processing is necessary. Recently, a lightweight model design has been proposed as one type of effective model compression that aims to design more efﬁcient network computing methods. In this paper, a lightweight 3D point cloud object detection network architecture is proposed. The core innovation of the proposal consists of a lightweight 3D sparse convolution layer module (LW-Sconv module) and knowledge distillation loss. Firstly, in the LW-Sconv module, factorized convolution and group convolution are applied to the standard 3D sparse convolution layer. As the basic component of the lightweight 3D point cloud object detection network proposed in this paper, the LW-Sconv module greatly reduces the complexity of the network. Then, the knowledge distillation loss is used to guide the training of the lightweight network proposed in this paper to further improve the detection accuracy. Finally, extensive experiments are performed to verify the algorithm proposed in this paper. Compared with the baseline model, the proposed model can reduce the FLOPs and parameters by 3.7 times and 7.9 times, respectively. The lightweight model trained with knowledge distillation loss achieves comparable accuracy to the baseline. Experiments show that the proposed method greatly reduces the model complexity while ensuring detection accuracy.


Introduction
Recently, autonomous driving has attracted more and more attention. As an important part of the self-driving vehicle perception system, 3D object detection can intelligently predict the location, size, and category of key 3D objects near the vehicle and provide accurate environmental information for the vehicle. Light detection and ranging (LiDAR)based methods [1] occupy a major position in the field of 3D object detection. However, as the precision of the scanning equipment continues to increase, the scale of the original point cloud data becomes huge. Advanced 3D detectors are often accompanied by complex network structures that require billions of floating point operations (FLOPs) and are unable to be deployed on computationally limited platforms. Sufficient model compression techniques have been proposed in computer vision tasks to solve this problem, such as network pruning [2][3][4], quantization [5][6][7], lightweight model design [8][9][10], and knowledge distillation [11][12][13]. However, there are few studies on model compression in the 3D field. This paper aims to explore an efficient 3D convolutional neural network architecture that can meet the computing range required by our edge devices.
It is well known that the convolution layer is the most important 'time killer' in neural networks. In the above model compression technology, lightweight model design is used to improve the calculation way of the convolution layer, aiming at reducing the complexity of the model and achieving a more efficient network structure. We note the advanced lightweight models in the two-dimensional field. For example, Refs. [14][15][16] apply grouped point-wise convolution for the input data, reducing the dimension of the data and the number of parameters of the convolution layer; Refs. [17][18][19] use factorized convolution to design the lightweight model, which can effectively extract features and reduce the computational complexity of the model. Based on the above two ideas, this paper designs a novel 3D sparse convolution layer module (LW-Sconv module) as the basic module of an efficient 3D object detection network. The core part of the module consists of point-wise 3D convolution and depth-wise 3D sparse convolution. Simultaneously, transpose and reshape operations are introduced to help information flow between channels.
In this paper, an effective lightweight 3D point cloud object detection algorithm is proposed. The framework of the algorithm is composed of a novel 3D sparse convolution layer module. In addition, the algorithm is supervised and trained by three-part knowledge distillation loss, which effectively improves the detection accuracy of the algorithm. Experimental results on two public datasets show that the required operands and parameters are significantly reduced while ensuring accuracy. The main contributions of the method proposed in this paper are summarized as follows: (1) A novel 3D sparse convolution layer module. In order to be able to deploy the model on memory-limited devices, we design an LW-Sconv module using factorized convolution and group convolution to replace the 3D sparse convolutional layers. The aim is to reduce the complexity of the model. (2) An effective lightweight 3D object detector. Through the joint learning method, the detector is trained through three-part knowledge transfer, i.e., relation transfer, feature transfer, and output transfer. It is effective in obtaining detectors with the best detection performance.
The rest of this paper is organized as follows: The related work is reviewed in Section 2. Then, the detailed framework of the algorithm is described in Section 3 and the algorithm is evaluated by experiments in Section 4. Section 5 ends with a summary and conclusion.

Lightweight Model Design
The amount of calculation and parameters of the 3D point cloud object detection algorithm based on convolutional neural networks are mainly determined by the convolution layer and the fully connected layer. Therefore, most of the existing model acceleration methods are considered from the perspective of reducing the computational complexity of the convolution process. These methods redesign a network structure with lower computational overhead and space memory consumption. Ref. [20] proposes a new network structure that increases the nonlinearity of the network and reduces the complexity of the model by adding an additional layer of 1 × 1 convolution. In order to reduce the storage requirements of the CNN model, it also removes the fully connected layer of the network and uses a global average pool. Group convolution [21] is another commonly used strategy to reduce the amount of network computation. GoogLeNet [14] uses a large number of group convolutions and the effectiveness of group convolution is verified by combining different convolution kernels. Ref. [16] proposes that, by using a large number of 1 × 1 convolution and group convolution strategies, the compression of about 50 times the parameters of AlexNet [22] is finally achieved without reducing the accuracy; the model is named SqueezeNet. RestNeXt [19] uses group convolution and the performance exceeds RestNet [23] when it has the same order of magnitude of parameters and FLOPs. In Mo-bileNet [17], deep separable convolution is proposed. It replaces the standard convolution with a deep-wise convolution layer and a point-wise convolution layer. MobileNet can be faster than the VGG16 [24] network. ShuffleNet [15] introduces the concept of channel shuffle. By cross-mixing the channels between different group convolutions, the feature information learned by each group convolution can be exchanged. In the last two years, Ref. [25] has used dropout to design a lightweight CNN network to reduce the number of parameters and Ref. [26] designed a lightweight model based on a novel convolutional block to prevent overfitting. The block takes advantage of separable convolution and squeeze-expand operations.

Three-Dimensional Object Detection
The development of deep learning has promoted the research of 3D point cloud object detection and the construction of deeper and larger convolutional neural networks has become the mainstream trend. Representative methods are: PointNet [27], which uses a multilayer perceptron to learn the spatial features of points and maximum pooling to aggregate global features; VoxelNet [28], which divides the point cloud into multiple voxels, characterizes the point cloud data by voxel feature encoding, and extracts features by convolutional middle layer; PointPillars [29], which, unlike VoxelNet, divides the point cloud into different vertical cylinders and projects the point cloud in each cylinder onto a 2D feature map; PointRCNN [30], which uses a deformation-based convolution for feature extraction, retaining local structure information; GLENet [31], which constructs probabilistic detectors by generating uncertainty labels and proposes an uncertainty-aware quality estimator architecture to guide the training of IoU branches with predictive localization uncertainty.
However, these advanced 3D detection models have high complexity and are unable to be deployed in real-time applications such as autonomous driving. Exploring an efficient 3D object detection model has become a research hotspot. Among them, SECOND [32] uses 3D sparse convolution instead of 3D standard convolution. A structured knowledge distillation framework is proposed in PointDistiller [33] to obtain a lightweight 3D point cloud object detection model. Ref. [34] proposes to simplify KNN search and graph shuffling to improve the efficiency of convolution. Ref. [35] combines the features of the point-based branch and the voxel-based branch, which not only perform effective feature extraction but also reduce memory occupancy. Ref. [36] uses sparse point-voxel convolution instead of point-voxel convolution to reduce the complexity of the model. For the indoor 3D target detection task, Ref. [37] proposes a generative sparse detection network, where the key component of the model is a generative sparse tensor decoder that uses a series of transposed convolution and pruning layers to extend the support of sparse tensors while discarding unlikely object centers to maintain minimal runtime and memory footprint. Ref. [38] proposes an anchor-free method to achieve 3D target detection using a purely data-driven approach. A novel oriented bounding box parameterization is also introduced that reduces the number of hyperparameters.

Method
This section provides a detailed description of the lightweight 3D point cloud object detection algorithm proposed in this paper. First, the LW-Sconv module is introduced, which is the basic block for constructing the network structure. Then, the knowledge distillation loss used to improve the performance of the algorithm is described.

Lightweight 3D Sparse Convolution Layer Module
The core part of LW-Sconv module consists of a group convolution, a transpose and reshape operation, and a depth-wise sparse convolution. The filter of the group convolution is 1 × 1 × 1 and the filter of the depth-wise sparse convolution is 3 × 3 × 3. Each convolution operation is followed by batch normalization and ReLU. Due to the 3D sparse convolution having two types (sub-manifold sparse convolution and regular sparse convolution), we describe the two types of the LW-Sconv module in Figure 1. There are three adjustable hyperparameters in the LW-Sconv module: g, C, and C . g represents the number of groups in the 1 × 1 × 1 group convolution layer; C represents the number of filters in the group convolution layer; C represents the number of filters in the depth-wise sparse convolution layer. This paper sets C < C , which limits the number of input channels of depth-wise sparse convolution. convolution operation is followed by batch normalization and ReLU. Due to the 3D sparse convolution having two types (sub-manifold sparse convolution and regular sparse convolution), we describe the two types of the LW-Sconv module in Figure 1.
There are three adjustable hyperparameters in the LW-Sconv module: g , C , and ' C . g represents the number of groups in the 1 1 1   group convolution layer; C represents the number of filters in the group convolution layer; ' C represents the number of filters in the depth-wise sparse convolution layer. This paper sets ' C C  , which limits the number of input channels of depth-wise sparse convolution.
(a) (b) The overall goal of the lightweight model design in this paper is to change the calculation method of standard 3D sparse convolution and to identify a convolution neural network architecture with lower complexity. Therefore, this paper designs a lightweight 3D sparse convolution layer module with the following basis:  Using 1 1 1   group convolution to process the input data, not only are the parameters of the filter reduced compared with the ordinary 3 3 3   filter, but the 1 1 1   convolution after grouping is more suitable for lightweight networks with constrained complexity than the dense 1 1 1   convolution;  Applying transpose and reshape operations, the feature map is concatenated from the output of different group convolutions on the previous layer. The application of transpose and reshape operations is to avoid the non-circulation of information caused by grouping, which is conducive to the subsequent extraction of global features;  Applying 3 3 3   depth-wise sparse convolution, consider a standard 3D convolution, where the number of parameters in this layer is (the number of input channel)  (the number of filter)  ( 3 3 3   ). Using the depth-wise sparse convolution, the feature map is convolved channel-by-channel, which maintains a small number of parameters. Compared with a 1 1 1   filter, the receptive field becomes larger, the information read increases, and the global features obtained are better. The above lightweight sparse convolution layer module design applies the ideas of group convolution and depth-wise convolution to sparse convolution, which can be achieved by highly optimized general matrix multiplication (GEMM). Sparse convolution is similar to the standard convolution calculation process, which uses GEMM to accelerate matrix operations. Therefore, the above lightweight method is also applicable The overall goal of the lightweight model design in this paper is to change the calculation method of standard 3D sparse convolution and to identify a convolution neural network architecture with lower complexity. Therefore, this paper designs a lightweight 3D sparse convolution layer module with the following basis: • Using 1 × 1 × 1 group convolution to process the input data, not only are the parameters of the filter reduced compared with the ordinary 3 × 3 × 3 filter, but the 1 × 1 × 1 convolution after grouping is more suitable for lightweight networks with constrained complexity than the dense 1 × 1 × 1 convolution; • Applying transpose and reshape operations, the feature map is concatenated from the output of different group convolutions on the previous layer. The application of transpose and reshape operations is to avoid the non-circulation of information caused by grouping, which is conducive to the subsequent extraction of global features; • Applying 3 × 3 × 3 depth-wise sparse convolution, consider a standard 3D convolution, where the number of parameters in this layer is (the number of input channel) × (the number of filter) × (3 × 3 × 3). Using the depth-wise sparse convolution, the feature map is convolved channel-by-channel, which maintains a small number of parameters. Compared with a 1 × 1 × 1 filter, the receptive field becomes larger, the information read increases, and the global features obtained are better.
The above lightweight sparse convolution layer module design applies the ideas of group convolution and depth-wise convolution to sparse convolution, which can be achieved by highly optimized general matrix multiplication (GEMM). Sparse convolution is similar to the standard convolution calculation process, which uses GEMM to accelerate matrix operations. Therefore, the above lightweight method is also applicable and effective in sparse convolution. The difference is that the standard convolution uses im2col to gather and scatter, while the sparse convolution is based on the rulebook and hash table constructed in advance to construct the matrix and restore the spatial position. Due to the fact that the sparse convolution is different from the standard convolution in the way of matrix operation through GEMM, only the non-empty features in the feature map are stored when constructing the rulebook, so the feature map of size H × W × D × C Appl. Sci. 2023, 13, 6754 5 of 14 is converted to N × C, where N represents the number of non-empty positions and C represents the feature dimension. Using the formula of FLOPs in [17], the FLOPs of the standard sparse convolution layer are 3 × 3 × 3 × C × C . The FLOPs of the LW-Sconv module are N × C + 3 × 3 × 3 × N × C . It can be seen that the FLOPs have been greatly reduced after the lightweight design.

Lightweight 3D Point Cloud Object Detection Algorithm Framework
The overall framework of the algorithm in this paper is described in this section; the detailed network architecture is illustrated in Figure 2. Figure 2a shows the workflow of the algorithm. Firstly, the raw point cloud performs the voxelization to obtain voxels, which are used as the input of the network. The input voxels are encoded by voxel feature encoding (VFE) [28] to obtain voxel feature mapping. Then, the voxel features are sent to the 3D backbone. The extracted 3D features reshape to bird's eye view (BEV) along the z-axis direction and are fed into the 2D backbone for further feature extraction. Finally, they are sent to the detect head to obtain the output of the network. The most important feature is the architecture of the backbone, which is shown in Figure 2b. The 3D backbone in the algorithm is composed of 11 LW-Sconv modules and the 2D backbone is composed of 12 lightweight convolution layer modules (LW-conv module). For the 2D backbone, the design idea of the lightweight convolution layer module is the same as that of the 3D, but the filter is 2D. For the 3D backbone, an LW-Sconv module is not directly applied at the beginning because the number of input channels is relatively small. Starting from the second layer, the LW-Subsconv module and the LW-Regsconv module are stacked. The first convolution layer module in the second and third stage uses stride = 2. Other hyperparameters remain unchanged in the same stage and the number of output channels in the next stage is doubled.
tion is similar to the standard convolution calculation process, which uses GEMM to accelerate matrix operations. Therefore, the above lightweight method is also applicable and effective in sparse convolution. The difference is that the standard convolution uses im2col to gather and scatter, while the sparse convolution is based on the rulebook and hash table constructed in advance to construct the matrix and restore the spatial position. Due to the fact that the sparse convolution is different from the standard convolution in the way of matrix operation through GEMM, only the non-empty features in the feature map are stored when constructing the rulebook, so the feature map of size H W D C    is converted to N C  , where N represents the number of non-empty positions and C represents the feature dimension. Using the formula of FLOPs in [17], the FLOPs of the standard sparse convolution layer are 3 3 3 ' C C     . The FLOPs of the LW-Sconv module are It can be seen that the FLOPs have been greatly reduced after the lightweight design.

Lightweight 3D Point Cloud Object Detection Algorithm Framework
The overall framework of the algorithm in this paper is described in this section; the detailed network architecture is illustrated in Figure 2. Figure 2a shows the workflow of the algorithm. Firstly, the raw point cloud performs the voxelization to obtain voxels, which are used as the input of the network. The input voxels are encoded by voxel feature encoding (VFE) [28] to obtain voxel feature mapping. Then, the voxel features are sent to the 3D backbone. The extracted 3D features reshape to bird's eye view (BEV) along the z-axis direction and are fed into the 2D backbone for further feature extraction. Finally, they are sent to the detect head to obtain the output of the network. The most important feature is the architecture of the backbone, which is shown in Figure 2b. The 3D backbone in the algorithm is composed of 11 LW-Sconv modules and the 2D backbone is composed of 12 lightweight convolution layer modules (LW-conv module). For the 2D backbone, the design idea of the lightweight convolution layer module is the same as that of the 3D, but the filter is 2D. For the 3D backbone, an LW-Sconv module is not directly applied at the beginning because the number of input channels is relatively small. Starting from the second layer, the LW-Subsconv module and the LW-Regsconv module are stacked. The first convolution layer module in the second and third stage uses stride = 2. Other hyperparameters remain unchanged in the same stage and the number of output channels in the next stage is doubled.

Loss Function
The loss function for supervised training of the model is the knowledge distillation loss, because knowledge distillation is one of the techniques commonly used to improve the lightweight model. A teacher-student mode is used, where SECOND is used as the teacher model and the lightweight 3D point cloud object detection algorithm proposed in this paper is used as the student model. The overall training framework is shown in Figure 3, which contains three losses. L FSP constrains the student model to simulate the process of feature map extraction in the teacher model. L feature constrains the student model to simulate the feature map before sending it to the detect head in the teacher model. L out constrains the student model to simulate the soft label in the teacher model. The detailed introduction is below.

Loss Function
The loss function for supervised training of the model is the knowledge distillation loss, because knowledge distillation is one of the techniques commonly used to improve the lightweight model. A teacher-student mode is used, where SECOND is used as the teacher model and the lightweight 3D point cloud object detection algorithm proposed in this paper is used as the student model. The overall training framework is shown in  The calculation of the FSP L needs to define the FSP matrix first. The FSP matrix is used to define the flow of feature information in the backbone. The FSP matrix is used to calculate the dot product of the feature map that is the input and the output in each stage of the backbone. According to [39], the FSP matrix for each stage of the backbone is calculated as follows: where i is the number of the channel; 1 , , , p q r i F is the value under the coordinate of ( , , , ) The calculation of the L FSP needs to define the FSP matrix first. The FSP matrix is used to define the flow of feature information in the backbone. The FSP matrix is used to calculate the dot product of the feature map that is the input and the output in each stage of the backbone. According to [39], the FSP matrix for each stage of the backbone is calculated as follows: where i is the number of the channel; F 1 p,q,r,i is the value under the coordinate of (h, w, d, i); F 2 p,q,r,j is the value under the coordinate of (h, w, d, i); F 1 is the input feature map of the sub-module; F 2 is the output feature map of the sub-module; (h, w, d) is the size of the feature map. L FSP is to calculate the L 2 [40] loss of the student model FSP matrix and the teacher model FSP matrix. The calculation is as follows: where λ i represents the weight for each loss term; T is the number of the data; x is the input data; n is the number of the FSP matrix; G T i indicates the FSP matrix of x for the teacher model; G S i indicates the FSP matrix of x for the student model. The feature map sent to the detect head can provide rich information for the final prediction. L feature is to calculate the L 2 loss of feature map between the student model and the teacher model. The purpose is to improve the quality of the feature map in the student model and provide better feature information for the final prediction. The calculation is as follows: where t is the index number of the data; T is the number of the data; u S indicates the feature map of the x t in the student model; u T indicates the feature map of the x t in the teacher model. L out constrains the student model to learn the soft label of the teacher model, including the boundary box regression of the region proposal network and the classification label of the region classification network. The calculation is as follows: where t is the index number of the data; T is the number of the data; x t indicates the t th input data; g(x t ) is the soft label of the x t in the student model; z(x t ) is the soft label of the x t in the teacher model.

Dataset
This paper conducts experimental verification on two public datasets. The KITTI dataset [41] was co-founded by the Karlsruhe Institute of Technology and the Toyota American Institute of Technology. The dataset contains 7481 training samples and 7518 test samples. The dataset uses the average precision (mAP) to evaluate the object detection model. The nuScenes dataset [42] is a large-scale autonomous driving dataset developed by the Motional team for 3D object detection in urban scenes. The dataset contains 1000 driving scenarios and 390,000 LiDAR-scanned images. The evaluation indexes of the dataset include mAP, NDS, mATE, mASE, mAOE, mAVE, and mAAE. The main evaluation indexes of the dataset are mean average precision (mAP) and nuScenes detection score (NDS). Among them, mAP is calculated by different types of average precision (AP). The average precision (AP) metric defines the matching by the 2D center distance on the threshold ground plane. NDS is the weighted average of mAP and other attribute metrics, including translation, scale, direction, speed, and other box attributes.

Implementation Details
Taking SECOND [32] as the baseline, this approach is one of the most representative in voxel-based object detection. The architecture of SECOND is end-to-end with good real-time performance. For the training phase, we use an AdamW [43] optimizer with β 1 = 0.9 and β 2 = 0.999 to optimize the model. The initial learning rate is 0.003 and the weight decay is 0.0001. We train the network with a batch size of four on two NVIDIA Quadro RTX 6000 GPUs. The experiments in this paper are set according to the above; other configurations are the same as the OpenPCDet (https://github.com/open-mmlab/ OpenPCDet (accessed on 16 March 2020)) since we conduct all experiments with this toolbox. In this paper, training includes three stages. In the first stage, we minimize the loss function L FSP to make the FSP matrix of the student network similar to that of the teacher network. In the second stage, the model is initialized with the weights obtained from the first training stage. We use L feature to train the feature extraction backbone of the student network. In the third stage, the model is initialized with the weights obtained from the second training stage. We use L out to train the detection head. Finally, we fine-tune the entire network. This fine-tuning process uses the same configurations as the training phase.

Quantitative Evaluation
Experiment A. In order to verify the feasibility of the lightweight model design in this paper, three different lightweight 3D sparse convolution layer modules are designed for ablation experiments; the experiments are performed on the KITTI dataset.
The first design idea uses the 1 × 1 × 1 filter to process the input data, the number of which is S (where S < C); it can reduce the dimension of the input data. Then, the output of the upper layer passes through a 1 × 1 × 1 filter and a 3 × 3 × 3 filter, respectively, the number of which is C /2. At the end, the H × W × D × C feature map is concatenated from the output of two group convolutions.
The second design idea uses 3 × 3 × 3 depth-wise sparse 3D convolution to convolve the H × W × D × C input channel by channel, the number of which is C, and then uses 1 × 1 × 1 point-wise sparse 3D convolution to weight the combination of the output of the previous layer in the depth direction, the number of which is S.
The third detailed design idea is described in Section 3.1. The experimental results are shown in Table 1. Comparing SECOND with LW-SECOND-1, it can be seen that the calculation amount is reduced by 17.4 G, the number of parameters is reduced by 1.72 M, and the mAP is reduced by 8%. By comparing SECOND and LW-SECOND-2, it can be seen that the calculation amount is reduced by 49.6 G, the parameter amount is reduced by 4.41 M, and the mAP is reduced by 12.1%. Comparing SECOND and LW-SECOND-3, it can be seen that the calculation amount is reduced by 51.0 G, the parameter amount is reduced by 4.67 M, and the mAP is reduced by 16.4%.  Table 1 shows the FLOPs, parameters, and mAP for the object detection algorithm composed of three lightweight 3D sparse convolution layer modules designed in this paper. The three design methods have different degrees of reduction in the FLOPs and parameters, indicating that the lightweight 3D sparse convolution layer module explored in this paper is feasible and that the third idea is more compressed and better. It shows that the use of group convolution and factorized convolution can effectively compress the model. However, they are accompanied by a certain degree of accuracy reduction. Experiment B. Knowledge distillation loss is used to supervise the training of the lightweight 3D point cloud object detection algorithm. In order to prove that using the three-part knowledge distillation loss has the best effect on the performance improvement of the algorithm, the ablation experiments are performed on KITTI and nuScences.
Regarding the experiments on KITTI, we report the average precision calculated at 40 sampling recall locations for BEV object detection and 3D object detection, respectively. The implementation details of the experiment are the same as those in Section 4.2. The experimental results are shown in Table 2. The table shows the results of BEV detection and 3D detection in the KITTI dataset. Taking the test results for BEV detection as an example, we compare the SECOND and LW-SECOND-3. The accuracy is reduced from 63.3% to 48.7%. It can be seen that model compression with the lightweight model design reduces the detection accuracy of the algorithm. Compared with LW-SECOND-3 and LW-SECOND-3 , the detection accuracy improves from 48.7% to 48.8%. It can be seen that there is a certain effect, but it is not obvious, indicating that it is not enough to use only the L out constrained student model to simulate the teacher model. Comparing the detection results of LW-SECOND-3 and LW-SECOND-3 , it can be seen that the detection accuracy of L feature improves from 48.8% to 53.0% after constraint training, indicating that the feature map of the simulated teacher model has a better effect on the prediction output. Compared with LW-SECOND-3 and LW-SECOND-3 , the detection accuracy improves from 53.0% to 62.7%. It can be seen that there is a significant improvement in the effect, indicating Appl. Sci. 2023, 13, 6754 9 of 14 that guidance and supervision with the loss of three parts is the best recovery effect on the accuracy of the student model. Table 2. Experimental results for the KITTI dataset. Model indicates which network architecture is selected; LW-S-3 represents LW-SECOND-3, indicating that no knowledge distillation is used. The symbol indicates that only L out is used for knowledge distillation; the symbol indicates that L out and L feature are used for knowledge distillation; the symbol indicates that L out , L feature , and L FSP are used for knowledge distillation. Regarding the experiments on nuScenes, we report the detection results in terms of the evaluation indicators mAP, NDS, mATE, mASE, mAOE, mAVE, and mAAE. The implementation details of the experiment are the same as those in Section 4.2. The experimental results are shown in Table 3. The table shows the evaluation of algorithm performance by different evaluation indicators. The larger the values of mAP and NDS, the better; the smaller the values of mATE, mASE, mAOE, mAVE, and mAAE, the better. Compared with LW-SECOND-3 and LW-SECOND-3 , we can see that mAP and NDS increase by 10.3% and 9.4%, respectively. Regarding the experiments on nuScenes, we report the detection results in terms of the evaluation indicators mAP, NDS, mATE, mASE, mAOE, mAVE, and mAAE. The implementation details of the experiment are the same as those in Section 4.2. The experimental results are shown in Table 3. The table shows the evaluation of algorithm performance by different evaluation indicators. The larger the values of mAP and NDS, the better; the smaller the values of mATE, mASE, mAOE, mAVE, and mAAE, the better. Compared with LW-SECOND-3 and LW-SECOND-3′′′, we can see that mAP and NDS increase by 10.3% and 9.4%, respectively. Figure 4 also provides the detailed prevision-recall (PR) curves for two datasets, showing the difference between the model after the lightweight model design with knowledge distillation and without knowledge distillation. It can be seen in the two ablation experiments that the loss function in this paper can effectively restore the accuracy of the lightweight model by using the three-part knowledge distillation losses to supervise and train the lightweight 3D point cloud object algorithm. It ensures that the lightweight model can meet the task requirements while reducing the complexity of the model. Regarding the experiments on nuScenes, we report the detection results in terms of the evaluation indicators mAP, NDS, mATE, mASE, mAOE, mAVE, and mAAE. The implementation details of the experiment are the same as those in Section 4.2. The experimental results are shown in Table 3. The table shows the evaluation of algorithm performance by different evaluation indicators. The larger the values of mAP and NDS, the better; the smaller the values of mATE, mASE, mAOE, mAVE, and mAAE, the better. Compared with LW-SECOND-3 and LW-SECOND-3′′′, we can see that mAP and NDS increase by 10.3% and 9.4%, respectively. Figure 4 also provides the detailed prevision-recall (PR) curves for two datasets, showing the difference between the model after the lightweight model design with knowledge distillation and without knowledge distillation. It can be seen in the two ablation experiments that the loss function in this paper can effectively restore the accuracy of the lightweight model by using the three-part knowledge distillation losses to supervise and train the lightweight 3D point cloud object algorithm. It ensures that the lightweight model can meet the task requirements while reducing the complexity of the model. Regarding the experiments on nuScenes, we report the detection results in terms of the evaluation indicators mAP, NDS, mATE, mASE, mAOE, mAVE, and mAAE. The implementation details of the experiment are the same as those in Section 4.2. The experimental results are shown in Table 3. The table shows the evaluation of algorithm performance by different evaluation indicators. The larger the values of mAP and NDS, the better; the smaller the values of mATE, mASE, mAOE, mAVE, and mAAE, the better. Compared with LW-SECOND-3 and LW-SECOND-3′′′, we can see that mAP and NDS increase by 10.3% and 9.4%, respectively. Figure 4 also provides the detailed prevision-recall (PR) curves for two datasets, showing the difference between the model after the lightweight model design with knowledge distillation and without knowledge distillation. It can be seen in the two ablation experiments that the loss function in this paper can effectively restore the accuracy of the lightweight model by using the three-part knowledge distillation losses to supervise and train the lightweight 3D point cloud object algorithm. It ensures that the lightweight model can meet the task requirements while reducing the complexity of the model. Regarding the experiments on nuScenes, we report the detection results in terms o the evaluation indicators mAP, NDS, mATE, mASE, mAOE, mAVE, and mAAE. Th implementation details of the experiment are the same as those in Section 4.2. The ex perimental results are shown in Table 3. The table shows the evaluation of algorithm performance by different evaluation indicators. The larger the values of mAP and NDS the better; the smaller the values of mATE, mASE, mAOE, mAVE, and mAAE, the better Compared with LW-SECOND-3 and LW-SECOND-3′′′, we can see that mAP and NDS increase by 10.3% and 9.4%, respectively.  Figure 4 also provides the detailed prevision-recall (PR) curves for two datasets showing the difference between the model after the lightweight model design with knowledge distillation and without knowledge distillation. It can be seen in the two ab lation experiments that the loss function in this paper can effectively restore the accuracy of the lightweight model by using the three-part knowledge distillation losses to super vise and train the lightweight 3D point cloud object algorithm. It ensures that the light weight model can meet the task requirements while reducing the complexity of th model. 41 Regarding the experiments on nuScenes, we report the detection results in terms of the evaluation indicators mAP, NDS, mATE, mASE, mAOE, mAVE, and mAAE. The implementation details of the experiment are the same as those in Section 4.2. The experimental results are shown in Table 3. The table shows the evaluation of algorithm performance by different evaluation indicators. The larger the values of mAP and NDS, the better; the smaller the values of mATE, mASE, mAOE, mAVE, and mAAE, the better. Compared with LW-SECOND-3 and LW-SECOND-3′′′, we can see that mAP and NDS increase by 10.3% and 9.4%, respectively. Figure 4 also provides the detailed prevision-recall (PR) curves for two datasets, showing the difference between the model after the lightweight model design with knowledge distillation and without knowledge distillation. It can be seen in the two ablation experiments that the loss function in this paper can effectively restore the accuracy of the lightweight model by using the three-part knowledge distillation losses to supervise and train the lightweight 3D point cloud object algorithm. It ensures that the lightweight model can meet the task requirements while reducing the complexity of the model. Regarding the experiments on nuScenes, we report the detection results in terms of the evaluation indicators mAP, NDS, mATE, mASE, mAOE, mAVE, and mAAE. The implementation details of the experiment are the same as those in Section 4.2. The experimental results are shown in Table 3. The table shows the evaluation of algorithm performance by different evaluation indicators. The larger the values of mAP and NDS, the better; the smaller the values of mATE, mASE, mAOE, mAVE, and mAAE, the better. Compared with LW-SECOND-3 and LW-SECOND-3′′′, we can see that mAP and NDS increase by 10.3% and 9.4%, respectively.  Figure 4 also provides the detailed prevision-recall (PR) curves for two datasets, showing the difference between the model after the lightweight model design with knowledge distillation and without knowledge distillation. It can be seen in the two ablation experiments that the loss function in this paper can effectively restore the accuracy of the lightweight model by using the three-part knowledge distillation losses to supervise and train the lightweight 3D point cloud object algorithm. It ensures that the lightweight model can meet the task requirements while reducing the complexity of the model. Regarding the experiments on nuScenes, we report the detection results in term the evaluation indicators mAP, NDS, mATE, mASE, mAOE, mAVE, and mAAE. implementation details of the experiment are the same as those in Section 4.2. Th perimental results are shown in Table 3. The table shows the evaluation of algor performance by different evaluation indicators. The larger the values of mAP and N the better; the smaller the values of mATE, mASE, mAOE, mAVE, and mAAE, the b Compared with LW-SECOND-3 and LW-SECOND-3′′′, we can see that mAP and increase by 10.3% and 9.4%, respectively.   Figure 4 also provides the detailed prevision-recall (PR) curves for two data showing the difference between the model after the lightweight model design knowledge distillation and without knowledge distillation. It can be seen in the tw lation experiments that the loss function in this paper can effectively restore the accu of the lightweight model by using the three-part knowledge distillation losses to su vise and train the lightweight 3D point cloud object algorithm. It ensures that the l weight model can meet the task requirements while reducing the complexity o model.   Regarding the experiments on nuScenes, we report the detection results in terms of the evaluation indicators mAP, NDS, mATE, mASE, mAOE, mAVE, and mAAE. The implementation details of the experiment are the same as those in Section 4.2. The experimental results are shown in Table 3. The table shows the evaluation of algorithm performance by different evaluation indicators. The larger the values of mAP and NDS, the better; the smaller the values of mATE, mASE, mAOE, mAVE, and mAAE, the better. Compared with LW-SECOND-3 and LW-SECOND-3′′′, we can see that mAP and NDS increase by 10.3% and 9.4%, respectively. Figure 4 also provides the detailed prevision-recall (PR) curves for two datasets, showing the difference between the model after the lightweight model design with knowledge distillation and without knowledge distillation. It can be seen in the two ablation experiments that the loss function in this paper can effectively restore the accuracy of the lightweight model by using the three-part knowledge distillation losses to supervise and train the lightweight 3D point cloud object algorithm. It ensures that the lightweight model can meet the task requirements while reducing the complexity of the model. Regarding the experiments on nuScenes, we report the detection results in terms of the evaluation indicators mAP, NDS, mATE, mASE, mAOE, mAVE, and mAAE. The implementation details of the experiment are the same as those in Section 4.2. The experimental results are shown in Table 3. The table shows the evaluation of algorithm performance by different evaluation indicators. The larger the values of mAP and NDS, the better; the smaller the values of mATE, mASE, mAOE, mAVE, and mAAE, the better. Compared with LW-SECOND-3 and LW-SECOND-3′′′, we can see that mAP and NDS increase by 10.3% and 9.4%, respectively. Figure 4 also provides the detailed prevision-recall (PR) curves for two datasets, showing the difference between the model after the lightweight model design with knowledge distillation and without knowledge distillation. It can be seen in the two ablation experiments that the loss function in this paper can effectively restore the accuracy of the lightweight model by using the three-part knowledge distillation losses to supervise and train the lightweight 3D point cloud object algorithm. It ensures that the lightweight model can meet the task requirements while reducing the complexity of the model. Regarding the experiments on nuScenes, we report the detection results in term the evaluation indicators mAP, NDS, mATE, mASE, mAOE, mAVE, and mAAE. implementation details of the experiment are the same as those in Section 4.2. Th perimental results are shown in Table 3. The table shows the evaluation of algor performance by different evaluation indicators. The larger the values of mAP and N the better; the smaller the values of mATE, mASE, mAOE, mAVE, and mAAE, the b Compared with LW-SECOND-3 and LW-SECOND-3′′′, we can see that mAP and increase by 10.3% and 9.4%, respectively.  Figure 4 also provides the detailed prevision-recall (PR) curves for two data showing the difference between the model after the lightweight model design knowledge distillation and without knowledge distillation. It can be seen in the tw lation experiments that the loss function in this paper can effectively restore the accu of the lightweight model by using the three-part knowledge distillation losses to su vise and train the lightweight 3D point cloud object algorithm. It ensures that the l weight model can meet the task requirements while reducing the complexity o model.

Appl. Sci. 2023, 13, x FOR PEER REVIEW
Regarding the experiments on nuScenes, we report the detection resu the evaluation indicators mAP, NDS, mATE, mASE, mAOE, mAVE, and implementation details of the experiment are the same as those in Sectio perimental results are shown in Table 3. The table shows the evaluation performance by different evaluation indicators. The larger the values of m the better; the smaller the values of mATE, mASE, mAOE, mAVE, and mA Compared with LW-SECOND-3 and LW-SECOND-3′′′, we can see that m increase by 10.3% and 9.4%, respectively.  Figure 4 also provides the detailed prevision-recall (PR) curves for two datasets, showing the difference between the model after the lightweight model design with knowledge distillation and without knowledge distillation. It can be seen in the two ablation experiments that the loss function in this paper can effectively restore the accuracy of the lightweight model by using the three-part knowledge distillation losses to supervise and train the lightweight 3D point cloud object algorithm. It ensures that the lightweight model can meet the task requirements while reducing the complexity of the model. lation experiments that the loss function in this paper can effectively restore the accura of the lightweight model by using the three-part knowledge distillation losses to sup vise and train the lightweight 3D point cloud object algorithm. It ensures that the lig weight model can meet the task requirements while reducing the complexity of model.

Experiment C.
In order to prove the necessity of model compression, the 3D obj detection algorithm proposed in this paper is compared with other 3D object detect algorithms. The comparison results are shown in Table 4. It can be seen that, compar with other algorithms, the parameters and FLOPs of the algorithm proposed in this pap are extremely reduced and that the detection accuracy is maintained at the upper a middle levels. Obviously, the method proposed in this paper is more suitable for Experiment C. In order to prove the necessity of model compression, the 3D object detection algorithm proposed in this paper is compared with other 3D object detection algorithms. The comparison results are shown in Table 4. It can be seen that, compared with other algorithms, the parameters and FLOPs of the algorithm proposed in this paper are extremely reduced and that the detection accuracy is maintained at the upper and middle levels. Obviously, the method proposed in this paper is more suitable for deployment on edge devices. Table 4 shows the results with AP calculated by recall 40 positions for car class on the KITTI test set. F and P indicate the number of float operations (/G) and parameters (/M).

Qualitative Results
In this section, we visualize the results of 3D object detection on KITTI and nuScences. It shows the influence of whether to use the knowledge distillation loss for supervised training on the lightweight 3D point cloud object detection algorithm. Figure 5 shows some detection results from the KITTI test set. Figure 6 shows some detection results from the nuScences test set. Through comparison, it can be seen that the lightweight 3D point cloud object detection algorithm using knowledge distillation is more accurate. The purple circle shows the area where the error classification is reduced. The reason is mainly due to the L feature lightweight 3D point cloud object detection algorithm, which simulates the feature map before feeding it into the detect head in SECOND. The learning of the feature map provides more information for the output prediction and improves the classification accuracy. The red circle shows the area where the false positive prediction is reduced. The main reason is that L FSP , the lightweight 3D point cloud object detection algorithm, simulates the process of feature extraction in SECOND and knows how to extract more effective features to provide useful feature maps for subsequent detection.

Conclusions
In order to reduce the complexity of the model and make it meet the needs of our edge devices, this paper proposes a lightweight 3D point cloud object detection algorithm. Firstly, a novel 3D sparse convolution layer module is designed by using factorized convolution and group convolution, which is used as the construction block of a lightweight convolution network. The module is composed of point-wise 3D convolution and depth-wise 3D sparse convolution. At the same time, transpose and reshape operations are introduced to process the feature map and help information flow between channels. It reduces the complexity of the model and accelerates it. In addition, this paper is inspired by the knowledge distillation, using the teacher-student mode to complete training. Extensive experimental verification is carried out on two public datasets. The effectiveness of improving the detection accuracy of lightweight models is proven through quantitative evaluation and qualitative results. Finally, the algorithm proposed in this paper is compared with the baseline, which shows that the complexity of the model is greatly reduced while the detection accuracy is not significantly reduced. Compared with other 3D point cloud object detection algorithms, the 3D point cloud object detection algorithm in this paper achieves a better balance between complexity and detection accuracy. It indicates that the algorithm proposed in this paper is more suitable for deployment on edge devices. The lightweight 3D point cloud object detection algorithm explored in this paper has certain theoretical and practical significance. In the future, it can be extended to other 3D tasks to reduce model complexity and achieve model acceleration.