Abstract:
In this paper, we propose MedoidsFormer, a novel transformer-based backbone equipped with a self-attention mechanism that is tailored explicitly to LiDAR-based 3D object ...View moreMetadata
Abstract:
In this paper, we propose MedoidsFormer, a novel transformer-based backbone equipped with a self-attention mechanism that is tailored explicitly to LiDAR-based 3D object detection. Unlike 2D object detection, the proportion of target objects to the input scene is much smaller, and their distribution is significantly sparser in 3D object detection. Given these observations, we introduce a new self-attention mechanism called Medoids Attention, focusing on exploiting interactions within surrounding regions, which not only reduces computation and memory costs but obtains discriminative context information. Instead of aggregating tokens from adjacent areas, we present a dynamic semantic-aware token mining process through k-Medoids clustering to direct select representative tokens for attention modeling. Our proposed method shows consistent improvement over existing 3D object detectors through extensive experiments and achieves state-of-the-art performance on the large-scale Waymo Open Dataset. We also conduct comprehensive ablation studies to verify the efficacy of the new self-attention mechanism and provide thorough insights.
Published in: IEEE Transactions on Circuits and Systems for Video Technology ( Volume: 33, Issue: 10, October 2023)