1 Introduction

3D object detection is one of the key technologies in the field of autonomous driving and has received wide attention. Recently, the points scanned by the LiDAR sensor have become the main input for 3D object detection [1]. However, in real scenes, the distribution of the point cloud is sparse, irregular, and unbalanced, and thus LiDAR-based 3D object detection still faces great challenges.

The current advanced 3D object detection methods can be classified into view-based [2,3,4], point-based [5,6,7,8], and voxel-based [9,10,11,12] approaches. The view-based methods project point clouds onto 2D views and then apply well-developed 2D Convolutional Neural Networks (CNN) networks on different views for 3D detection. However, the 3D spatial information is compressed and lost due to these projection operations, which limits the detection performance of 3D objects. Subsequently, point-based methods are proposed to reserve the spatial information of the original point cloud. They extract point-wise features directly from the original point cloud to generate accurate detection boxes. However, these methods spend too much computation to reconstruct the neighborhood points, thus leading to low inference speed. To reduce the computation and accelerate the inference speed, the voxel-based methods convert the point clouds into regular voxels and then extract the voxel feature with 3D sparse convolution. Such operations preserve the 3D spatial information and reduce the computational complexity at the same time.

However, in real scenes, some weak objects are scanned with too few points to depict their complete boundary, and they cannot provide sufficient spatial features. In this case, there is not enough attention to focus on the boundaries, making it difficult for the above methods to detect these weak objects, and thus damaging the overall detection performance.

To address the above issue, we introduce the attention mechanism to compute the spatial contextual correlation among different parts of objects, so as to guide the network focus on those parts that lack points to depict the boundary, and thus determining the boundaries of the objects more clearly. That is, we establish an attention-based 3D object detection network CIANet, which effectively improves the overall detection performance, especially for weak objects with small sizes like pedestrians and cyclists.

Specifically, for the first stage of the efficient voxel-based network Voxel-RCNN [12], which compresses the 3D spatial features into BEV to generate the initial proposals and thus ignores the global spatial contextual association, we first construct the Channel-Spatial Hybrid Attention (CSHA) module to model channel interdependencies and explore important spatial information of BEV features, followed by a 2D CNN to further extract the enhanced features. Then we construct the Contextual Self-Attention (CSA) module to extract the global spatial contextual cues among different parts of objects, which is supplied to the enhanced BEV features. In this way, the sparse boundaries with too few points of objects are enhanced to generate high-quality proposals. In addition, in the second stage of the network, we first employ the voxel RoI pooling operation to capture the RoI feature of the object, and then we construct the Dimensional-Interaction Attention (DIA) module to extract the internal association among different dimensions of the feature, thus highlighting the RoI feature for refining the proposal. Attributed to the above well-designed modules based on the attention mechanism, CIANet can focus on the boundary information of weak objects in real scenes, and localize these objects accurately, thus improving the overall detection performance.

The main contributions of this work are as follows:

  • We present the Channel-Spatial Hybrid Attention (CSHA) module and Contextual Self-Attention (CSA) module in the first stage, which highlight vital channel-spatial features and aggregate rich global contextual information of objects to generate more accurate proposals.

  • We design a Dimensional-Interaction Attention (DIA) module in the second stage. It integrates the interactions between the channel dimension and the spatial dimensions of objects to enhance RoI features, thus refining the proposals to generate the final accurate detection boxes.

  • Our proposed novel CIANet achieves better 3D object detection performance than other advanced methods on the KITTI and Waymo benchmarks, especially for weak objects with small sizes like pedestrians and cyclists. Experimental results show that our proposed attention modules can enrich the feature representation of the object and improve detection performance.

2 Related Work

2.1 3D Object Detection with LiDAR Point Clouds

Existing 3D object detection methods based on the LiDAR point cloud can be divided into three categories, including view-based, point-based, and voxel-based methods. The view-based methods [2,3,4] convert the point cloud into 2D views, which are then fed into 2D CNN for detection. In particular, MV3D [2] converts the point cloud into the bird’s eye view and front view and then integrates the RGB image as input data for object detection. However, MV3D is difficult to be widely applied in real scenarios. To further improve the detection speed, PIXOR [3] implements quick and efficient object detection based on the bird’s eye view of the scene. Furthermore, considering that the point cloud has a highly variable point density in the bird’s eye view, MVF [4] fuses the perspective view that provides dense observation into the bird’s eye view to utilize complementary information to generate accurate detection boxes. Overall, these view-based methods project the point cloud into 2D views, which result in the loss of point cloud depth information, thus affecting the detection performance.

Point-based methods [5,6,7,8] employ PointNet [13] or PointNet +  + [14] to extract features from the original point cloud, avoiding information loss due to projection operation. The typical two-stage PointRCNN [5] employs PointNet +  + to extract features and segment the foreground points to generate high-quality 3D proposals, and then it combines semantic features and local spatial features to refine the proposals. Subsequently, to reduce the excessive computational complexity of the two-stage network, single-stage detector 3DSSD [6] removes the time-consuming upsampling layer and refinement stage, and then proposes a new fusion sampling strategy to improve detection efficiency. However, these point-based methods take lots of computation to search neighboring points during the feature extraction process, which result in slow inference speed.

Voxel-based methods [9,10,11,12] convert the point cloud into regular voxels and employ the well-developed sparse CNN to extract features, thus improving the computation speed of the network. Typical method SECOND [9] adopts the 3D sparse convolution to extract the voxel feature, followed by a region proposal network to generate the detection box, which maintains fast detection speed. In addition, PointPillars [10] voxelized point clouds into pillars, which are then further encoded into 2D pseudo-images, followed by a 2D CNN to extract features, thus further accelerating the detection speed. However, SECOND and PointPillars lose spatial geometry information due to voxelization, which affects the detection accuracy. Some methods [15,16,17] address such problem by integrating the voxel features into key points, which lead to a rich feature representation with 3D structure information but increase the computation. Considering that the precise localization of the original points is not necessary, Voxel-RCNN [12] designs a voxel RoI pooling operation to aggregate 3D structure context from 3D voxel features, thus improving detection efficiency while preserving 3D spatial knowledge. Moreover, to further improve detection accuracy, CT3D [18] refines the proposals with a channel-wise Transformer architecture consisting of the proposal-to-point embedding operation, the self-attention-based encoder, and the channel-wise decoder, thus providing rich spatial semantic features to the detection head for generating detection boxes. These voxel-based methods have higher computational efficiency, and they adopt detailed refinement schemes to improve the detection performance of the network.

However, in real scenes, some weak objects are scanned with too few points to depict their complete boundary, and they cannot provide sufficient spatial features. In this case, there is not enough attention to focus on the boundaries, which makes it difficult for the above methods to detect these weak objects, decreasing overall detection accuracy. Therefore, we apply attention mechanisms upon the voxel-based backbone network to enhance critical boundary information of objects, thus further improving the overall detection performance of the network.

2.2 Attentions in Computer Version

In recent years, the attention mechanism has been successfully applied in computer vision fields, which can focus on key information and suppress irrelevant interference [19]. Traditional channel and spatial attention mechanisms [20,21,22] build attention masks to select the vital channels and spatial regions that need to be concerned, thus enhancing the representation of key features for objects. These traditional attention mechanisms only focus on local regions, hence, the self-attention mechanism is proposed to acquire global contextual information of objects. Furthermore, the Transformer structure based on self-attention [23] is proposed to capture long-range contextual information and highlight critical feature information of objects.

Due to the complexity of the scene in the 3D object detection task, it is essential to focus on the critical features of objects, so attention mechanisms are also introduced in 3D object detection. TANet [24] employs channel attention, point attention, and voxel attention strategies to extract the key information of objects in the scene, achieving convincing detection performance. SA-Det3D [25] establishes two self-attention modules to model the contextual information of the 3D object, which achieves superior object detection performance. Pointformer [26] explores and integrates local and global context-aware information to obtain dependencies between multi-scale representations. VoTr [27] adopts the Transformer to construct a backbone network for aggregating the information of the empty and non-empty voxels, thus expanding the non-empty voxel space. Then it utilizes the self-attention mechanism of the Transformer to capture the global context information among the expanded non-empty voxels, thus achieving the attention-weighted features with enhanced context in a larger receptive field.

The above attention mechanism can effectively improve the object detection performance. Therefore, in this work, we apply the attention mechanism on the baseline detector to enhance the critical boundary information of weak objects, thus accurately locating them and improving the overall detection performance.

3 CIANet for 3D Object Detection

In this section, we present the detailed design of the two-stage voxel-based detection network CIANet, where the first stage contains a 3D backbone network and a proposal generation network, and the second stage contains a proposal refinement network. Figure 1. illustrates an overview of the CIANet, and in the first stage, the 3D backbone network first extracts voxel features from the point clouds. Then in the proposal generation network, we convert the voxel features into 2D BEV features and then construct the CSHA module to enhance the 2D BEV features, followed by the CSA module to further supplement the 2D BEV features with spatial contextual information, thus generating the proposals. In the second stage, we first take the voxel RoI pooling operation to extract the RoI features within the proposals, and then we design the DIA module to further highlight the RoI features for proposal refinement, thus producing accurate detection boxes. We will describe specific attention modules in the following sections.

Fig. 1
figure 1

An overview of CIANet. The whole network consists of two stages: the first stage extracts voxel features with a 3D backbone network, and then applies a proposal generation network composed of a Channel-Spatial Hybrid Attention (CSHA) module, a 2D CNN and a Contextual Self-Attention (CSA) module to generate high-quality proposals. In the second stage, the proposals are refined via voxel RoI pooling and the Dimensional-Interaction Attention (DIA) module to generate accurate 3D detection boxes

3.1 Channel-Spatial Hybrid Attention Module

In the 3D backbone network of the first stage, we divide the original point cloud into m regular voxels, then employ the 3D sparse convolution to extract the high-level voxel feature with rich semantic information, which is fed into the proposal generation network to be converted into 2D BEV feature for proposal generation. Actually, the dependencies between channels are ignored by the 3D sparse convolution operation. Meanwhile, 3D spatial information is compressed during the converting process, resulting in that the key cues of the 2D spatial region are not distinct. Hence, we construct a novel Channel-Spatial Hybrid Attention (CSHA) module, which combines channel and spatial attention mechanisms to explore interdependencies among channels of the 2D BEV feature and highlight vital spatial information. The architecture of the CSHA module is shown in Fig. 2.

Fig. 2
figure 2

Illustration of the architecture of the CSHA module, which consists the channel-domain branch and spatial-domain branch. It combines channel and spatial attention mechanisms to explore interdependencies among channels and highlight vital spatial information of the BEV feature

Specifically, we send the 2D BEV feature \(F \in {\mathbb{R}}^{C \times H \times W}\) to the Channel-Spatial Hybrid Attention (CSHA) module, which contains a channel domain branch and a spatial domain branch. In the channel domain branch, we first apply the Global Average Pooling (GAP) operation with the size of H × 1 to condense the spatial dimension H for achieving the feature \(F_{1} \in {\mathbb{R}}^{C \times 1 \times W}\), and then utilize the max pooling operation with the size of 1 × W along the spatial dimension W to generate the feature \(F_{2} \in {\mathbb{R}}^{C \times H \times 1}\). Then we send \(F_{1}\) and \(F_{2}\) to the Multi-Layer Perceptron (MLP) consisting of a dimension-reducing layer and a dimension-raising layer to generate the pooled feature \(F_{g} \in {\mathbb{R}}^{C \times 1 \times 1}\). Then a sigmoid activation function is utilized to obtain the channel attention map, which is integrated with the feature \(F\) through element-wise multiplication to generate the channel-reweighted feature \(F_{c} \in {\mathbb{R}}^{C \times H \times W}\), as shown in Eq. (1).

$$ \begin{gathered} F_{c} = \sigma \left( {MLP\left( {GAP\left( F \right)} \right){ + }MLP\left( {MaxPool\left( F \right)} \right)} \right) \odot F \hfill \\ \, = \sigma \left( {W_{2} \left( {W_{1} \left( {F_{{1}} } \right)} \right){ + }W_{2} \left( {W_{1} \left( {F_{{2}} } \right)} \right)} \right) \odot F \hfill \\ \end{gathered} $$
(1)

where \(W_{1} \in {\mathbb{R}}^{{\left( {{C \mathord{\left/ {\vphantom {C r}} \right. \kern-0pt} r}} \right) \times C}}\) and \(W_{2} \in {\mathbb{R}}^{{C \times \left( {{C \mathord{\left/ {\vphantom {C r}} \right. \kern-0pt} r}} \right)}}\) are the MLP weights, \(r\) is the reduction ratio.

In the spatial domain branch, we first update the feature \(F \in {\mathbb{R}}^{C \times H \times W}\) by the max pooling and average pooling with the size of C × 1 to achieve two feature maps of \(F_{m} \in {\mathbb{R}}^{1 \times H \times W}\) and \(F_{a} \in {\mathbb{R}}^{1 \times H \times W}\). They are concatenated on the channel dimension, followed by a convolution layer and a sigmoid activation function to obtain the spatial attention map \(F_{p} \in {\mathbb{R}}^{1 \times H \times W}\). Then we compute the product between the spatial attention map \(F_{p}\) and \(F\) to get the region-reweighted feature \(F_{s} \in {\mathbb{R}}^{C \times H \times W}\), as shown in Eq. (2).

$$ F_{s} = \sigma \left( {Conv\left( {\left[ {AvgPool\left( F \right);MaxPool\left( F \right)} \right]} \right)} \right) \odot F $$
(2)

Finally, we employ the element-wise multiplication to integrate the channel-reweighted feature \(F_{c}\) and region-reweighted feature \(F_{s}\), thus achieving the output feature \(F_{out}\) with explicit channel dependencies and enhanced spatial cues, as shown in Eq. (3).

$$ F_{out} = F_{c} \odot F_{s} $$
(3)

And then the feature \(F_{out}\) is updated by the 2D CNN referring to [9, 12]. In this way, the vital channel and spatial information of the 2D BEV feature is strengthened based on the CSHA module.

3.2 Contextual Self-Attention Module

In the proposal generation network, we employ the CSHA module to obtain the enhanced 2D BEV feature. However, it lacks global spatial contextual associations. Considering that high-level voxel features in the 3D backbone network contain uncompressed 3D spatial information, we further construct a CSA module, which captures the spatial context among different parts of objects that are embedded in the high-level voxel features, and then supplies it to the enhanced BEV feature, thus enhancing the insufficient boundary of the weak object with sparse points. The architecture of the CSA module is shown in Fig. 3.

Fig. 3
figure 3

The architecture of the CSA module. The voxelized original point cloud is sampled by the farthest point sampling operation, followed by the bias estimation strategy and self-attention operation with relative position encoding strategy to capture the spatial context embedded in high-level voxel features

Specifically, for m voxels divided from the original point cloud in the 3D backbone, we calculate their center points based on the index number of each voxel, thus obtaining a set of center points denoted as \(Y_{center} \in {\mathbb{R}}^{m \times 3}\). Then we utilize the Farthest Point Sampling (FPS) operation to select \(n\) sampled points from the center points \(Y_{center} \in {\mathbb{R}}^{m \times 3}\). For the sampled point \(\overline{p}\), it has the point feature \(U_{{\overline{p}}} \in {\mathbb{R}}^{C}\) and the 3D position \(L_{{\overline{p}}} = (x^{{(\overline{p})}} ,y^{{(\overline{p})}} ,z^{{(\overline{p})}} ) \in {\mathbb{R}}^{3}\), and we propose the bias estimation strategy to calculate the position bias from its neighboring points for updating its position. In this way, the neighboring contextual information is aggregated upon the point \(\overline{p}\).

In detail, we first calculate the feature variation \(\Delta U^{q} \in {\mathbb{R}}^{C}\) and the position variation \(\Delta L^{q} \in {\mathbb{R}}^{3}\) between point \(\overline{p}\) and the neighboring point q, as illustrated in Eq. (4).

$$ \left\{ {\begin{array}{*{20}c} {\Delta U^{q} = U_{{\overline{p}}} - U_{q} , \, q \in \varphi (\overline{p})} \\ {\Delta L^{q} = L_{{\overline{p}}} - L_{q} , \, q \in \varphi (\overline{p})} \\ \end{array} } \right. $$
(4)

where \(\varphi (\overline{p})\) indicates the set of d neighbor points for point \(\overline{p}\), \(U_{q}\) is the point feature of point q, and \(L_{q}\) is the 3D position of point q.

Then we compute a weighted combination of the feature and position variations, thus achieving the final position bias \(\Delta L_{{\overline{p}}}\) for the point \(\overline{p}\), as denoted in Eq. (5).

$$ \Delta L_{{\overline{p}}} = \frac{{\sum\nolimits_{{q \in \varphi \left( {\overline{p}} \right)}} {MLP\left( {\Delta U^{q} } \right) \cdot \left( {\Delta L^{q} } \right)} }}{{\sum\nolimits_{{q \in \varphi \left( {\overline{p}} \right)}} {MLP\left( {\Delta U^{q} } \right)} }} $$
(5)

After that, we add the final position bias to the initial position of the sampled point \(\overline{p}\) to get the renewed position \(L^{\prime}_{p} \in {\mathbb{R}}^{3}\) of the updated point p, as shown in Eq. (6). The updated points can cover common feature structures in the 3D space.

$$ L^{\prime}_{p} = L_{{\overline{p}}} + \Delta L_{{\overline{p}}} $$
(6)

Next, we aggregate the features of the neighboring points to achieve the renewed feature for the updated point, which is calculated as Eq. (7).

$$ U^{\prime}_{p} = \sum\limits_{{q \in \varphi \left( {\overline{p}} \right)}} {\omega U_{q} } $$
(7)

In this way, the features of all updated points are adaptively updated to achieve the aggregated feature \(U^{\prime} = \{ u^{\prime}_{1} ,u^{\prime}_{2} , \cdots ,u^{\prime}_{n} \in {\mathbb{R}}^{C} \}\).

To further explore the global contextual information, we employ the self-attention mechanism to calculate semantic interactions between the pair-wise aggregated features.

Exploring global contextual information for updated points by the self-attention mechanism is comparable to capturing semantic correlation between feature nodes while passing messages in a graph. Hence, we employ the graph \(G = (\upsilon ,\lambda )\) to describe the collection of aggregated features and their relationship, where \(\upsilon = \{ u^{\prime}_{1} ,u^{\prime}_{2} , \cdots ,u^{\prime}_{n} \in {\mathbb{R}}^{C} \}\) represents a feature node set and \(\lambda = \{ r_{pg} \in {\mathbb{R}}^{H} \}\) represents an edge set. The \(r_{pg}\) denotes the relationship between feature node p and node g, and \(H\) represents the number of attention heads across \(C\) input channels.

Furthermore, considering that the relative position information between nodes contains accurate spatial dependencies, we introduce the relative position-coding strategy into the self-attention mechanism to extract the global contextual information more accurately. In detail, we multiply the aggregated feature of the updated point p of the single attention head with projection matrices \(W_{Q} \in {\mathbb{R}}^{n \times n}\) to achieve query vector \(Q_{p} \in {\mathbb{R}}^{{n \times ({C \mathord{\left/ {\vphantom {C H}} \right. \kern-0pt} H})}}\), and then we multiply the semantic feature of the updated point g with the projection matrices \(W_{K} \in {\mathbb{R}}^{n \times n}\), and \(W_{V} \in {\mathbb{R}}^{n \times n}\) to achieve key vector \(K_{g} \in {\mathbb{R}}^{{n \times ({C \mathord{\left/ {\vphantom {C H}} \right. \kern-0pt} H})}}\) and value vector \(V_{g} \in {\mathbb{R}}^{{n \times ({C \mathord{\left/ {\vphantom {C H}} \right. \kern-0pt} H})}}\), respectively. Then we propose a relative position encoding strategy, it first restricts the maximum value of the Euclidean distance between coordinates of node p and node g as b, and then applies a linear layer to encode the coordinate distance into relative position information. The detailed calculation of position encoding is shown in Eq. (8).

$$ a_{pg} = {\text{linear}} \left( {\min \left( {\left\| {L^{\prime}_{p} - L^{\prime}_{g} } \right\|,b} \right)} \right),g \in \psi $$
(8)

where \(a_{pg} \in {\mathbb{R}}^{{n \times \left( {{C \mathord{\left/ {\vphantom {C H}} \right. \kern-0pt} H}} \right)}}\) is the encoded relative position feature and \(\left\| \cdot \right\|\) denotes Euclidean distance, and \(\psi\) denotes the set of update points except p.

Subsequently, we embed the encoded relative position feature in \(K_{g}\) and \(V_{g}\), and calculate the contextual correlation term \(r_{pg}\) between the feature node p and the feature node g, as shown in Eq. (9).

$$ r_{pg} = {\text{softmax}} \left( {\frac{{Q_{p} \left( {K_{g} + a_{pg} } \right)^{T} }}{{\sqrt {{C \mathord{\left/ {\vphantom {C H}} \right. \kern-0pt} H}} }}} \right) \cdot \left( {V_{g} + a_{pg} } \right) $$
(9)

For the node p, we calculate the sum of its contextual correlation terms with other nodes to obtain the accrued term \(S_{p}\), as shown in Eq. (10).

$$ S_{p} = \sum\limits_{g \in \psi } {r_{pg} } $$
(10)

Then we concatenate the accrued term \(S_{p}\) across attention heads, followed by a linear layer, a group normalization, and a residual connection to generate the global contextual feature of node p. Next, we share the global contextual features of n nodes to m nodes by interpolation, thus achieving the context features of m nodes. And each node feature represents the feature of the corresponding voxel where the node lies in.

At last, we merge the context features with the enhanced BEV features to achieve the merged features with global contextual association information, which are sent to a Region Proposal Network [28] for generating the proposals.

3.3 Dimensional-Interaction Attention Module

To refine the proposals, we construct a refinement network in stage 2, which consists of a voxel RoI pooling operation and a Dimensional-Interaction Attention (DIA) module. Specifically, we first divide the proposal into different grids and utilize the voxel RoI pooling operation to extract RoI features. Then, to further highlight the vital grid features that contribute to detection, we intend to employ the attention mechanism for feature enhancement, thus strengthening the RoI features for proposal refinement.

Considering the traditional channel attention mechanism compresses the spatial dimension to capture the dependencies among channel dimensions, which results in the loss of spatial information and thus the internal correlation between the spatial and channel dimensions is ignored. Although some approaches utilize spatial attention as a supplementary module to channel attention, they calculate channel attention and spatial attention separately, resulting in the internal correlation between the two attention dimensions not being captured. Thus, we construct a DIA module to capture the internal correlation among different dimensions of the feature, thus enhancing RoI features for proposal refinement. The architecture of the DIA module is shown in Fig. 4.

Fig. 4
figure 4

The architecture of the DIA module. It consists of four branches, which include swap, pooling, convolution, and activation operations. The first three branches capture the internal correlation between the channel and spatial dimensions of RoI features, and the last branch explores the dependencies among spatial dimensions. Finally, we calculate the mean of the outputs of four branches to obtain augmented RoI features for proposal refinement

Specifically, we extract the RoI feature \(T \in {\mathbb{R}}^{C \times L \times W \times H}\) within proposals by employing the voxel RoI pooling operation [12]. It divides the proposals into \(L \times H \times W\) grids. For each layer of the 3D sparse convolution, the neighboring voxels of the center point in each grid are determined based on the Manhattan distance, and then a PointNet [13] is utilized to aggregate neighboring voxel features onto the center of the grid to obtain the aggregated features of this layer. Finally, the aggregated features of the last three layers are concatenated to obtain the RoI features \(T \in {\mathbb{R}}^{C \times L \times W \times H}\).

And then the RoI feature \(T\) is fed into the DIA module with four branches. The first branch establishes the interactions between the channel dimension \(C\) and spatial dimensions \(W\),\(H\). In detail, we first swap the channel dimension \(C\) and spatial dimension \(L\) of the RoI feature to get the feature \(T_{1} \in {\mathbb{R}}^{L \times C \times W \times H}\), followed by a max pooling and an average pooling operation to compress the first dimension of \(T_{1}\) as two, thus obtaining the pooled feature \(\widehat{{T_{1} }} \in {\mathbb{R}}^{2 \times C \times W \times H}\). After that, we employ a 3D convolution layer and a batch normalization layer on \(\widehat{{T_{1} }}\) to capture the contextual connection between the channel dimension \(C\) and the spatial dimensions \(W\), \(H\), generating the interaction attention feature \(\widehat{{T_{1}^{*} }} \in {\mathbb{R}}^{1 \times C \times W \times H}\). Then we swap the two dimensions to achieve the feature \(T_{1}^{*} \in {\mathbb{R}}^{C \times 1 \times W \times H}\), followed by a sigmoid activation function to obtain the interaction attention weight. At last, we multiply the weight by the feature \(T\) to get the re-weighted RoI feature \(B_{1}\) with the same shape of \(T\). The above process is described as Eq. (11).

$$ B_{1} = \sigma \left( {P_{{1}}^{*} \left( {\gamma \left( {\chi \left( {Pool\left( {P_{{1}} \left( T \right)} \right)} \right)} \right)} \right)} \right) \cdot T $$
(11)

where \(\sigma\) indicates the sigmoid activation function, \(\chi\) and \(\gamma\) indicate the 3D convolution layer and batch normalization layer, respectively. \(P_{{1}}\) and \(P_{{1}}^{*}\) represent two swap operations.

In the same way, we aim to build interactions between the channel dimension \(C\) and spatial dimensions \(L\), \(H\) in the second branch. We first swap the dimensions of the RoI feature \(T \in {\mathbb{R}}^{C \times L \times W \times H}\) and then compact its spatial dimension \(W\) by the pooling operations to achieve the feature \(\widehat{{T_{{2}} }} \in {\mathbb{R}}^{2 \times L \times C \times H}\). Subsequently, a 3D convolution layer and a batch normalization layer are employed on \(\widehat{{T_{{2}} }}\) to generate the interaction attention feature \(\widehat{{T_{2}^{*} }} \in {\mathbb{R}}^{1 \times L \times C \times H}\). Furthermore, we swap the dimensions of \(\widehat{{T_{2}^{*} }}\) and then adopt the sigmoid activation function to achieve the interaction attention weight, which is multiplied by \(T\) to obtain the re-weighted RoI feature \(B_{2}\). The above process is denoted in Eq. (12).

$$ B_{2} = \sigma \left( {P_{2}^{*} \left( {\gamma \left( {\chi \left( {Pool\left( {P_{2} \left( T \right)} \right)} \right)} \right)} \right)} \right) \cdot T $$
(12)

where \(P_{2}\) and \(P_{{2}}^{*}\) represent two swap operations.

Similarly, in the third branch, we utilize the same approach adopted in the above two branches to build interactions between the channel dimension \(C\) and spatial dimensions \(L\), \(W\), thus obtaining the re-weighted RoI feature \(B_{3}\), as shown in Eq. (13).

$$ B_{3} = \sigma \left( {P_{3}^{*} \left( {\gamma \left( {\chi \left( {Pool\left( {P_{3} \left( T \right)} \right)} \right)} \right)} \right)} \right) \cdot T $$
(13)

where \(P_{3}\) and \(P_{3}^{*}\) represent two swap operations.

In the last branch, we capture the spatial dependencies among spatial dimensions \(L\), \(W\) and \(H\). Firstly, we compress the channel dimension \(C\) of the RoI feature \(T\) through pooling operations to obtain the feature \(T_{{4}} \in {\mathbb{R}}^{2 \times L \times W \times H}\), followed by a 3D convolution layer and a batch normalization layer to get the feature \(T_{{4}}^{*} \in {\mathbb{R}}^{1 \times L \times W \times H}\). And then the \(T_{{4}}^{*}\) is activated by the sigmoid function to get the attention weight, which is multiplied by \(T\) to achieve the re-weighted RoI feature \(B_{4}\), as shown in Eq. (14).

$$ B_{4} = \sigma \left( {\gamma \left( {\chi \left( {Pool\left( T \right)} \right)} \right)} \right) \cdot T $$
(14)

Subsequently, we obtain the final augmented RoI features by calculating the mean of the output features of the four branches, as described in Eq. (15).

$$ D_{out} = \frac{1}{4}\sum\limits_{i = 1}^{4} {B_{i} } $$
(15)

Finally, we send the augmented RoI features into a 2-layer MLP, followed by two branches for confidence prediction and box regression to generate the accurate detection boxes.

3.4 Training Losses

Our CIANet is a two-stage detection network, which is trained in an end-to-end fashion. The overall loss includes two parts, one is the region proposal loss \({\mathcal{L}}_{RPN}\) in the first stage, and the other is the proposal refinement loss \({\mathcal{L}}_{RCNN}\) in the second stage, as shown in Eq. (16).

$$ {\mathcal{L}}_{ALL} = {\mathcal{L}}_{RPN} + {\mathcal{L}}_{RCNN} $$
(16)

The \({\mathcal{L}}_{RPN}\) includes the loss of classification and box regression, as shown in Eq. (17).

$$ {\mathcal{L}}_{RPN} { = }\frac{{1}}{\varepsilon }\left[ {\sum\limits_{i} {{\mathcal{L}}_{cls} \left( {c_{i}^{a} ,l_{i}^{ * } } \right) + \vartheta \left( {l_{i}^{ * } \ge 1} \right)\sum\limits_{i} {{\mathcal{L}}_{reg} \left( {r_{i}^{a} ,o_{i}^{ * } } \right)} } } \right] $$
(17)

where \(\varepsilon\) is the number of foreground anchors. For classification loss \({\mathcal{L}}_{cls}\), we apply the Focal Loss function [29] to calculate the loss between the classification output \(c_{i}^{a}\) and the classification label \(l_{i}^{ * }\). For the box regression loss \({\mathcal{L}}_{reg}\), we apply \(\vartheta \left( {l_{i}^{ * } \ge 1} \right)\) to select the foreground anchors and calculate their loss by Huber loss, which calculates the loss between the regression output \(r_{i}^{a}\) and the regression target \(o_{i}^{ * }\), as shown in Eq. (18).

$$ {\mathcal{L}}_{reg} \left( {r_{i}^{a} ,o_{i}^{ * } } \right) = \left\{ {\begin{array}{*{20}c} {\frac{1}{2}\left( {r_{i}^{a} - o_{i}^{ * } } \right)^{2} , \, \left| {r_{i}^{a} - o_{i}^{ * } } \right| \le \delta } \\ {\delta \cdot \left( {\left| {r_{i}^{a} - o_{i}^{ * } } \right| - \frac{1}{2}\delta } \right),otherwise} \\ \end{array} } \right. $$
(18)

where \(\delta\) is a hyperparameter calculated from \(r_{i}^{a}\) and \(o_{i}^{ * }\) during training phase.

The \({\mathcal{L}}_{RCNN}\) includes the box regression loss and the IoU-guided confidence prediction loss of the second stage, as shown in Eq. (19).

$$ {\mathcal{L}}_{RCNN} = \frac{1}{\rho }\left[ {\sum\limits_{i} {{\mathcal{L}}_{cls} \left( {c_{i} ,\partial_{i}^{ * } \left( {IoU_{i} } \right)} \right) + } \vartheta \left( {IoU_{i} \ge \mu_{reg} } \right)\sum\limits_{i} {{\mathcal{L}}_{reg} \left( {r_{i} ,o_{i}^{ * } } \right)} } \right] $$
(19)

where \(\rho\) is the number of sampled 3D proposals during the training phase. The \({\mathcal{L}}_{reg}\) refers to box regression loss, which is implemented by the Huber Loss. The \(IoU_{i}\) represents the IoU score corresponding to the ith proposal and its ground truth. We adopt \(\vartheta \left( {IoU_{i} \ge \mu_{reg} } \right)\) to calculate the regression loss of the high-quality proposals, which are selected with \(IoU_{i} \ge \mu_{reg}\). In addition, \({\mathcal{L}}_{cls}\) represents the classification loss based on the IoU score, which is implemented by the Binary Cross Entropy Loss function. The \(c_{i}\) denotes the output of classification and \(\partial_{i}^{ * } \left( {IoU_{i} } \right)\) denotes the target of classification loss, which is calculated as shown in Eq. (20).

$$ \begin{aligned}&{l}\partial_{i}^{ * } \left( {IoU_{i} } \right) = \left\{ {\begin{array}{*{20}l} {0 \qquad IoU_{i} < \mu_{B} ,} \\ {\frac{{IoU_{i} - \mu_{B} }}{{\mu_{F} - \mu_{B} }} \, \mu_{B} \le IoU_{i} < \mu_{F} ,} \\ {1 \qquad IoU_{i} > \mu_{F} ,} \\ \end{array} } \right.\end{aligned} $$
(20)

where \(\mu_{F}\) and \(\mu_{B}\) represent the foreground and background IoU thresholds, respectively.

4 Experiment

4.1 Datasets and Evaluation Metrics

We first train and test our CIANet on the KITTI dataset, which provides 7481 training LiDAR samples and 7518 test samples. Following the experimental scheme in [9, 12], we further divide the 7481 training samples into a training set of 3712 samples and a validation set of 3769 samples. Meanwhile, the objects in the KITTI dataset are classified into three categories of cars, pedestrians, and cyclists. Each category is divided into three difficulty levels of easy, moderate, and hard due to the degree of occlusion and truncation.

We evaluate the detection performance with the evaluation metrics of Average Precision (AP) and mean Average Precision (mAP), where AP is computed by averaging the precision of all predicted detection boxes at a given difficulty level and mAP is calculated by meaning the average precisions of predicted boxes among three difficulty levels. We calculate the AP and mAP values on the validation set with 11 recall positions and the AP values on the test set with 40 recall positions, respectively. Moreover, the IoU threshold is set to 0.7 for cars and 0.5 for pedestrians and cyclists.

Furthermore, we also conduct experiments on a large-scene dataset Waymo [30]. It has 798 training sequence data that contain 158,361 point cloud samples, and 202 validation sequence data that contain 40,077 samples. The samples of Waymo are split into two difficulty levels: LEVEL_1 and LEVEL_2, where LEVEL_1 objects have at least five LiDAR points and LEVEL_2 objects have at least one LiDAR point. In the Waymo dataset, we utilize the IoU of 0.7 for vehicles and 0.5 for pedestrians and cyclists. Meanwhile, we adopt the mean Average Precision (mAP) and the mean Average Precision weighted by Heading (mAPH) as evaluation criteria.

4.2 Implementation Details

Network Architecture. For the KITTI dataset, the ranges of X, Y, and Z axes of the 3D scene are [0, 70.4] m, [-40, 40] m, and [-3, 1] m. The scene is divided into 16,000 voxels for training, and 40,000 voxels for testing, and the size of each voxel is set to (0.05 m, 0.05 m, 0.1 m). For the Waymo dataset, the ranges of X, Y, Z axes of the 3D scene are within [-75.2, 75.2] m, [-75.2, 75.2] m, and [-2, 4] m, and the voxel size is set as (0.1 m, 0.1 m, 0.15 m). In the first stage of the network, we first apply the 3D backbone network that stacks four sparse convolution layers to recover the voxel feature dimensions as 16–32-64–64. Then in the CSHA module, since the smaller value of reduction ratio r will increase the computation of the model and the larger value may decrease accuracy, we set r to 16 to balance the complexity and accuracy. And then we apply two CSA modules with 4 attention heads to capture global spatial contextual associations of objects. In each CSA module, we sample 2048 points from the center points of the voxels, and for each sampled point, we select its 32 neighboring points within a radius of 4 m to calculate the biases to update its position and feature. When utilizing the relative position encoding strategy, the distance b is set as 16. In addition, the feature dimension of the self-attention is 64, and the interpolation radius is set as 1.6 m to select 16 downsampled points to propagate their features for the original points.

In the second stage, we divide the proposal into 6 × 6 × 6 grids, and then set two Manhattan distance thresholds of 2 and 4 to select neighboring voxels for the grid center point during voxel RoI pooling. Then in the DIA module, the kernel size of the 3D convolution layer is 3 × 3 × 3, and the stride is set as 1.

Training and Inference Details. For the KITTI dataset, our CIANet is trained on the GeForce RTX 3090 Ti GPU in an end-to-end manner for 80 epochs with batch size 4. For the Waymo dataset, we train our CIANet on the same device with batch size 4 for epochs 30. We adopt the ADAM optimizer and set the initial learning rate as 0.01, which is updated by the cosine annealing strategy. In the training phase, we adopt the Non-Maximum Suppression (NMS) method with a threshold of 0.8 to select 512 initial proposals. Then we employ the IoU threshold 0.55 to sample 128 proposals for classification and regression, where the positive and negative proposals have a ratio of 1:1. And the positive samples have IoU > 0.55 with respect to the ground truth box. For classification loss, we set the foreground IoU threshold \(\mu_{F}\) as 0.75 and the background IoU threshold \(\mu_{B}\) as 0.25. For regression loss, we set the box regression IoU threshold \(\mu_{reg}\) as 0.55. In the testing phase, we remove the redundant proposals by NMS threshold 0.85 to retain the top 100 proposals for box refinement. After that, we further remove redundant detection boxes with the NMS threshold of 0.1 to obtain the final 3D detection boxes.

Moreover, to improve the generality of the model, we adopt several common data augmentation techniques. In detail, we select half of the scenes to flip along the X axis and rotate the scene around the Z axis with an angle scope of [–π/4, π/4], and scale the scene with a random scaling factor between 0.95 and 1.05. Also, we conduct the ground-truth sampling augmentation mechanism, which randomly pastes some new ground-truth objects from other scenes to the current scene. For each object category, we paste at least 15 ground-truth objects.

4.3 3D Detection on the KITTI Dataset

We train the CIANet model on the training set of the KITTI dataset and evaluate it on the validation and test sets. And then we compare the detection accuracy of CIANet with other state-of-the-art methods, as shown in Table 1 and Table 2. Furthermore, we evaluate the computational complexity of CIANet and compare it with other models on the KITTI validation set, as shown in Table 3.

Table 1 Comparison of AP(%) values with different methods on KITTI validation set at easy, moderate, and hard levels for car, pedestrian, and cyclist categories, and all results are 11 recall positions (The bold values denote the best results)
Table 2 Comparison of AP(%) values with different methods on KITTI test set at easy, moderate, and hard levels for car, pedestrian, and cyclist categories, and all results are 40 recall positions (The bold values denote the best results)
Table 3 Comparison of the computational complexity for different methods on KITTI validation set, and all results are 11 recall positions

Comparison with other methods. We display the detection results of our CIANet and other advanced methods on the validation set in Table 1, and it is clear that the AP values of our CIANet outperform other methods for pedestrians and cyclists that are hard to be detected. Compared with baseline network Voxel-RCNN, our method achieves AP improvements of 2.36%, 2.53%, and 1.65% for pedestrians at easy, moderate, and hard levels, and the AP gains of 2.89%, 1.67%, and 2.46% for cyclists at three difficulty levels. Notably, our method also surpasses Voxel-RCNN by 0.42%, 0.73%, and 0.79% AP gains at three difficulty levels for cars, respectively. Compared to the voxel-based method CT3D, which employs the Transformer structure composed of the encoder and decoder, our method achieves distinct AP improvements of 3.72%, 2.7%, and 1.15% for pedestrians and 2.08%, 2.07%, and 2.64% for cyclists at easy, moderate, and hard levels. In addition, the AP values for car detection are also improved by 0.57% and 0.56% at easy and hard levels than CT3D. Then, compared with the advanced detector VoTr, which adopts the Transformer to extract the enhanced features in a larger receptive field, our CIANet achieves significant AP gains for car category at three difficulty levels. In recent years, point-voxel combined methods have also achieved great detection performance. For the typical point-voxel-based network PV-RCNN, our CIANet has significant AP advantages at three categories of objects, especially the AP gains of 7.7% for pedestrians at moderate level and 6.19% for cyclists at hard level. Compared with the latest Octree-based Transformer network OcTr, our CIANet gains substantial advantages. In detail, CIANet gets AP gains by (1.25%, 6.32%, 2.16%) for cars, and (6.46%, 5.37%, 4.56%) for pedestrians, and (1.83%, 3.34%, 4.52%) for cyclists at three difficulty levels.

We further evaluate our CIANet and recent methods on the KITTI test set and display their detection results in Table 2. By comparing their AP values, it is obvious that our CIANet outperforms other methods for pedestrians and cyclists at moderate and hard levels, and it also ranks first for cars at the hard level.

Compared with the baseline network Voxel-RCNN, our method obtains significant AP improvement by 3.55%, 5.72%, and 4.11% for pedestrians at three difficulty levels, and 0.35%, 0.51%, and 0.48% AP gains for cars. Also, our method achieves 0.97% and 0.45% AP boosts for cyclists at moderate and hard levels, respectively. For the attention-based detection network TANet which specializes in detecting pedestrians, our CIANet improves the AP values for pedestrians by 0.49% at the moderate level and 1.01% at the hard level. The IASSD is a new voxel-based detection network, and our CIANet exceeds it by 3.63% for pedestrians on the moderate level and 3.47% on the hard level. Meanwhile, the AP values of our network outperform IASSD by (2.1%, 1.63%, 2.29%) for cars and (2.11%, 1.96%, 1.66%) for cyclists at three difficulty levels. Compared to the network EPNet +  + , which fuses point cloud and image information for object detection, our CIANet still achieves significant AP advantages by (6.74%, 8.26%, 6.11%) for cyclists at easy, moderate, and hard levels, and (0.45%, 0.21%) for pedestrians at moderate and hard levels.

Overall, the detection results on the validation and test sets demonstrate that our method achieves superior detection performance than other state-of-the-art methods, especially for pedestrians and cyclists. This is attributed to the attention mechanisms in our CIANet can compute the spatial contextual correlation among different parts of objects, thus guiding the network focus on the parts of the object that lack points to depict the boundary. That is, our CIANet can enhance the feature representation of weak objects with inadequate boundaries, thus it effectively boosts the detection accuracy of pedestrians and cyclists with small sizes.

Analysis of computational complexity. To evaluate the computational complexity of CIANet, we compare the number of Parameters (Params) and the number of FLOating-Point operations (FLOPs) of different methods under the same mAP metric in Table 3. It can be seen that our CIANet consumes comparable Params and FLOPs compared to the baseline Voxel-RCNN, but it significantly outperforms Voxel-RCNN in terms of mean detection precision for the three object categories. Moreover, in comparison to the Params and FLOPs of all methods, we observe that Voxel-RCNN has the lowest FLOPs and fewer Params, i.e., the baseline Voxel-RCNN has the highest computational efficiency. Our CIANet inherits the advantage of moderate parameters and high computational efficiency of this baseline network, and at the same time achieves the highest mAP values among all methods. This demonstrates that our CIANet can effectively improve the detection accuracy while maintaining low computational complexity.

4.4 3D Detection on the Waymo Dataset

We train the CIANet model on the training sequence of the Waymo dataset and evaluate its detection performance for vehicles, pedestrians, and cyclists on the validation sequence. Subsequently, we compare the detection accuracy of CIANet with some recent methods, as shown in Table 4.

Table 4 Comparison of the detection performance for different methods on the Waymo validation sequence (The bold values denote the best results)

From the detection results on the Waymo validation sequence in Table 4, it is noted that CIANet has the highest detection accuracy for vehicles, pedestrians, and cyclists over other typical detection networks. Compared with the voxel-based network Part-A2, which has a convincing performance on the Waymo dataset, our CIANet obtains significant mAP and mAPH gains at different difficulty levels, especially for mAPH of pedestrians with gains of 1.84% and 1.36% at LEVEL_1 and LEVEL_2, and mAP of cyclists with improvements of 1.45% at LEVEL_1. The mAP and mAPH values of LEVEL_1 vehicles also significantly surpass Part-A2 by 1.58% and 1.44%, respectively. Then compared to the point-voxel-based network PV-RCNN, our CIANet achieves remarkable mAP and mAPH advantages for the detection of vehicles, pedestrians, and cyclists. Concretely, the mAP values of our network outperform PV-RCNN by (1.12%, 1.44%, 2.24%) at LEVEL_1, respectively, and (0.67%, 1.31%, 1.53%) at LEVEL_2. And our mAPH values outperform PV-RCNN by (1.06%, 3.06%, 2.03%) at LEVEL_1, and (0.82%, 2.37%, 1.82%) at LEVEL_2. Moreover, our CIANet achieves substantial accuracy improvements than the recent IASSD for vehicles, pedestrians, and cyclists at two levels. In addition, from the 3rd column of Table 4, it can be seen that our CIANet’s average mAPH value among the three categories surpasses other methods, which effectively proves that the overall detection performance of our CIANet on the Waymo dataset exceeds other detectors. This indicates that our proposed attention modules promote the network to perceive the sparse boundary features of objects in the large-scene dataset, thus significantly boosting the detection accuracy.

4.5 Ablation Studies

We conduct ablation experiments to validate the effectiveness of our designed attention modules in the CIANet. All ablation experiments are performed on the KITTI validation set with 11 recall locations and the mAP value is adopted as the evaluation criterion.

4.5.1 Ablation studies of attention modules

In this section, we first report the detection results of baseline Voxel-RCNN in the 1st row of Table 5 for comparison. Then we add the CSHA module upon the baseline to highlight the vital information in the channel and spatial domains of BEV features, as shown in the 2nd row, leading to the mAP gains of 0.2%, 0.65%, and 0.64% for cars, pedestrians, and cyclists. This proves that the CSHA module can effectively heighten BEV features to improve detection performance. Subsequently, we further add the CSA module to extract the global context information and supply it to the enhanced BEV features, as shown in the 3rd row. The 2nd and 3rd rows show that the CSA module boosts the mAP values with gains of 0.23%, 0.8%, and 0.79% for three categories of objects, which illustrates that the global contextual correlations among different parts of objects captured by the CSA module facilitate the detection. Finally, we further append the DIA module to capture the interactions between channel and spatial dimensions of the RoI feature, as shown in the 4th row. From the 3rd and 4th rows, we observe that the mAP values for three categories of objects are improved by 0.22%, 0.73%, and 0.91%, respectively. This proves that it is necessary for detection to further enhance RoI features with the attention mechanism.

Table 5 The results of ablation experiments for CIANet on KITTI validation set, here we report the mAP(%) with 11 recall position

Overall, successively adding the above attention modules to the baseline can gradually improve the mAP values of the three object categories, and in particular, the mAP improvements of pedestrians and cyclists are more significant. This validates the effectiveness of our method based on elaborate attention modules.

4.5.2 Ablation studies based on the CSHA module

To further explore the effectiveness of the CSHA attention module, we conduct ablation experiments as shown in Table 6. Here, we first remove the CSHA module from CIANet to obtain a new model named CSHA-N, and its detection accuracies for three object categories are shown in 1st row. Then we add the classical attention-based module CBAM with serial channels and spatial attention branches to the CSHA-N model, as shown in the 2nd row. From the 1st and 2nd rows, we observe that adding the CBAM model alone only slightly promotes the mAP gains. Then we add the Channel Domain (CD) branch and Spatial Domain (SD) branch of the CSHA module to the CSHA-N model in turn, as denoted in the 3rd and 4th rows. From the 1st and 3rd rows, we find that the CD branch brings out mAP improvements for cars, pedestrians, and cyclists, which verifies that modeling the interdependence between feature channels by the CD branch can improve detection accuracy. From the 1st and 4th rows, it can be seen that the addition of the SD branch boosts the mAP values. This certifies that the SD branch can highlight crucial spatial information of the object to facilitate the detection. Lastly, we apply both the CD and SD branches to the CSHA-N baseline, as shown in the 5th row, which achieves mAP improvements of 0.21%, 0.75%, and 0.66% compared to CSHA-N. From the 3rd to 5th rows, it is noted that combining two branches contributes more to detection performance than utilizing one branch alone. And from the 2nd and 5th rows, we observe that our CSHA module with parallel channels and spatial attention branches achieves better detection performance than the traditional attention module CBAM. This demonstrates that the parallel manner is more efficient than the traditional serial manner to splice the channel and spatial branches.

Table 6 The results of ablation experiments for CSHA module on KITTI validation set

4.5.3 Ablation studies based on the CSA module

We employ the CSA module which employs the position bias operation and the self-attention mechanism to extract global contextual information of objects. To verify the effect of the position bias operation, we conduct ablation experiments as shown in Table 7.

Table 7 The results of ablation experiments for position bias operation in CSA module on KITTI validation set

Concretely, we remove the position bias operation from CIANet, as shown in the 1st row, resulting in the decrease of the mAP values by 0.13%, 0.69%, and 0.7% respectively. This drop verifies that the position bias operation updating the position information of the sampled points is conducive to detection.

4.5.4 Ablation studies based on the DIA module

To explore the contribution of the internal mechanism of the DIA module to the detection performance, we conduct ablation experiments on its four branches (Br1, Br2, Br3, Br4), respectively, and the experimental results are shown in Table 8. We first remove the DIA module from CIANet to form a detection model named DIA-N, and its detection results are displayed in the 1st row.

Table 8 The results of ablation experiments for DIA module on KITTI validation set

Then we introduce the first branch (Br1) upon DIA-N to explore the interactions between the channel dimension C and spatial dimensions W, H of RoI features, as shown in the 2nd row, resulting in mAP gains of 0.06%, 0.2%, and 0.25% for cars, pedestrians, and cyclists. Afterwards, as denoted in the 3rd row, we further add the second branch (Br2) to learn the interactions between the channel dimension C and spatial dimensions L, H of RoI features. From the 2nd and 3rd rows, it can be seen that mAP values are improved by 0.07%, 0.19%, and 0.26% for the three categories of objects. Similarly, as illustrated in the 4th row, we apply the third branch (Br3) to explore the correlations between the channel dimension C and spatial dimensions L, W. And from the 3rd and 4th rows, we can see that the mAP values are boosted by 0.05%, 0.18%, and 0.25%, respectively. At last, as shown in the 5th row, we add the fourth branch (Br4) to capture the dependencies among spatial dimensions L, W, and H, which further increases the mAP values by 0.04%, 0.16%, and 0.15%, respectively. And from the 1st to 5th rows, we observe that the detection accuracy of the network is gradually increased with the sequential addition of the above four branches. This indicates that capturing the interaction information among different dimensions of RoI features can enhance the feature representation of the object, thus promoting detection accuracy.

4.6 Visualization of the Results

We visualize in Fig. 5 the detection results of baseline Voxel-RCNN and our CIANet on KITTI validation set. For scene a) of Fig. 5, Voxel-RCNN misses two pedestrians and one car with extremely sparse points, but our CIANet accurately detects them. For scene b), we observe that our CIANet detects two occluded pedestrians, but one of them is ignored by the Voxel-RCNN. Meanwhile, CIANet precisely detects one distant cyclist with sparse points, which is also missed by Voxel-RCNN. In scene c), Voxel-RCNN misses three long-distance cars and one small cyclist, while CIANet only misses one extremely distant car. Similarly, in scene d), Voxel-RCNN fails to detect two distant pedestrian objects, while CIANet only omits the one located on the right. These intuitive visualization results further demonstrate that our method can effectively detect objects in complex scenes, especially for weak small objects with sparse points. This means that our elaborate attention modules indeed facilitate the improvement of detection performance. Besides, we also find that although CIANet achieves better detection results than Voxel-RCNN in scene c) and scene d), it is unable to detect some extremely distant objects. Actually, the voxel-based backbone network of CIANet adopts the voxel centroids to represent the voxels of objects. And for extremely distant objects with too sparse boundaries, the centers of the voxels that correspond to their boundaries tend to fall outside the ground truth boxes, which causes these boundary voxels to be misclassified as background voxels, and thus the extremely distant objects are missed.

Fig. 5
figure 5

Qualitative results on different scenes of the KITTI dataset. The first row displays the RGB images with 3D ground truth boxes, and the second row describes the corresponding point cloud scenes with 3D ground truth boxes, where cars, pedestrians and cyclists are labeled as green, blue and yellow, respectively. The third and fourth rows display the 3D object prediction boxes of the baseline network Voxel-RCNN and our CIANet, where all the prediction boxes are labeled as red. All the foreground objects in the above point cloud scenes are colored as orange

5 Conclusion

In this paper, we present a novel two-stage object detection network CIANet based on elaborate attention modules. In the first stage, we first explore the channel interdependence and crucial spatial information of the BEV feature, followed by a 2D CNN to obtain enhanced BEV features. Then we replenish the enhanced BEV features with the spatial contextual associations, which are captured from different parts of objects. In this way, we highlight the sparse boundary parts of weak objects to generate high-quality proposals. Finally, in the second stage, we further capture the interactions between the channel and spatial dimensions of RoI features to focus on the prominent voxel grid features, and then apply the enhanced RoI features for proposal refinement, thus generating accurate detection boxes. Extensive experiments on the KITTI and Waymo datasets validate that CIANet achieves significant improvement in detection performance over existing methods, especially for weak objects with small sizes like pedestrians and cyclists.

In addition, this work still has the following shortcoming that needs to be addressed. In the voxel-based backbone network of CIANet, we utilize the voxel centroids to represent the voxels of the object, which makes the centroids of the boundary voxels of some objects, especially the weak objects with extremely sparse boundaries, tend to fall outside the ground truth boxes and are misclassified as background voxels. Hence, in future work, we intend to add a point-wise auxiliary branch to the backbone network, and it captures the native point-wise features that contain complete boundary information of the object and supplements them to voxel features, thus leading the network to more accurately perceive the boundaries of objects and further improve detection accuracy.