Diffusion-Based 3D Object Detection with Random Boxes

Zhou, Xin; Hou, Jinghua; Yao, Tingting; Liang, Dingkang; Liu, Zhe; Zou, Zhikang; Ye, Xiaoqing; Cheng, Jianwei; Bai, Xiang

doi:10.1007/978-981-99-8432-9_3

Xin Zhou¹⁵,
Jinghua Hou¹⁵,
Tingting Yao¹⁵,
Dingkang Liang¹⁵,
Zhe Liu¹⁵,
Zhikang Zou¹⁶,
Xiaoqing Ye¹⁶,
Jianwei Cheng¹⁷ &
…
Xiang Bai¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14426))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

670 Accesses
1 Citations

Abstract

3D object detection is an essential task for achieving autonomous driving. Existing anchor-based detection methods rely on empirical heuristics setting of anchors, which makes the algorithms lack elegance. In recent years, we have witnessed the rise of several generative models, among which diffusion models show great potential for learning the transformation of two distributions. Our proposed Diff3Det migrates the diffusion model to proposal generation for 3D object detection by considering the detection boxes as generative targets. During training, the object boxes diffuse from the ground truth boxes to the Gaussian distribution, and the decoder learns to reverse this noise process. In the inference stage, the model progressively refines a set of random boxes to the prediction results. We provide detailed experiments on the KITTI benchmark and achieve promising performance compared to classical anchor-based 3D detection methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Amit, T., Nachmani, E., Shaharbany, T., Wolf, L.: Segdiff: image segmentation with diffusion probabilistic models. arXiv preprint arXiv:2112.00390 (2021)
Bai, X., et al.: Transfusion: robust lidar-camera fusion for 3d object detection with transformers. In: CVPR (2022)
Google Scholar
Brempong, E.A., Kornblith, S., Chen, T., Parmar, N., Minderer, M., Norouzi, M.: Denoising pretraining for semantic segmentation. In: CVPR (2022)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chen, S., Sun, P., Song, Y., Luo, P.: Diffusiondet: diffusion model for object detection. arXiv preprint arXiv:2211.09788 (2022)
Chen, T., Li, L., Saxena, S., Hinton, G., Fleet, D.J.: A generalist framework for panoptic segmentation of images and videos. arXiv preprint arXiv:2210.06366 (2022)
Chen, T., Zhang, R., Hinton, G.: Analog bits: generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202 (2022)
Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3d object detection network for autonomous driving. In: CVPR (2017)
Google Scholar
Chen, Y., Liu, J., Zhang, X., Qi, X., Jia, J.: Voxelnext: fully sparse voxelnet for 3d object detection and tracking. In: CVPR (2023)
Google Scholar
Deng, J., Shi, S., Li, P., Zhou, W., Zhang, Y., Li, H.: Voxel r-cnn: towards high performance voxel-based 3d object detection. In: AAAI (2021)
Google Scholar
Duan, Y., Guo, X., Zhu, Z.: Diffusiondepth: diffusion denoising approach for monocular depth estimation. arXiv preprint arXiv:2303.05021 (2023)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: CVPR (2012)
Google Scholar
Gu, S., et al.: Vector quantized diffusion model for text-to-image synthesis. In: CVPR (2022)
Google Scholar
Han, J., Wan, Z., Liu, Z., Feng, J., Zhou, B.: Sparsedet: towards end-to-end 3d object detection. In: VISAPP (2022)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Google Scholar
Huang, T., Liu, Z., Chen, X., Bai, X.: EPNet: enhancing point features with image semantics for 3D object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 35–52. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_3
Ji, Y., et al.: DDP: diffusion model for dense visual prediction. arXiv preprint arXiv:2303.17559 (2023)
Ku, J., Mozifian, M., Lee, J., Harakeh, A., Waslander, S.L.: Joint 3d proposal generation and object detection from view aggregation. In: IROS (2018)
Google Scholar
Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: fast encoders for object detection from point clouds. In: CVPR (2019)
Google Scholar
Li, J., Liu, Z., Hou, J., Liang, D.: Dds3d: dense pseudo-labels with dynamic threshold for semi-supervised 3d object detection. In: ICRA (2023)
Google Scholar
Li, X., et al.: Logonet: towards accurate 3d object detection with local-to-global cross-modal fusion. In: CVPR (2023)
Google Scholar
Li, Y., et al.: Bevdepth: acquisition of reliable depth for multi-view 3d object detection. In: AAAI (2023)
Google Scholar
Liang, M., Yang, B., Wang, S., Urtasun, R.: Deep continuous fusion for multi-sensor 3D object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 663–678. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_39
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV (2017)
Google Scholar
Liu, N., Li, S., Du, Y., Torralba, A., Tenenbaum, J.B.: Compositional visual generation with composable diffusion models. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13677, pp. 423–439. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19790-1_26
Liu, Y., Wang, T., Zhang, X., Sun, J.: PETR: position embedding transformation for multi-view 3D object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13687, pp. 531–548. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19812-0_31
Liu, Z., Huang, T., Li, B., Chen, X., Wang, X., Bai, X.: Epnet++: cascade bi-directional fusion for multi-modal 3d object detection. IEEE Trans. Pattern Anal. Mach. Intell. (2022)
Google Scholar
Liu, Z., Zhao, X., Huang, T., Hu, R., Zhou, Y., Bai, X.: Tanet: robust 3d object detection from point clouds with triple attention. In: AAAI (2020)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Noh, J., Lee, S., Ham, B.: Hvpr: hybrid voxel-point representation for single-stage 3d object detection. In: CVPR (2021)
Google Scholar
Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3d object detection from rgb-d data. In: CVPR (2018)
Google Scholar
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In: NeurIPS (2017)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Google Scholar
Shi, G., Li, R., Ma, C.: PillarNet: real-time and high-performance pillar-based 3D object detection. In: Avidan, S., et al. (eds.) ECCV 2022. LNCS, vol. 13670, pp. 35–52. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20080-9_3
Shi, S., et al.: Pv-rcnn: point-voxel feature set abstraction for 3d object detection. In: CVPR (2020)
Google Scholar
Shi, S., Wang, X., Li, H.: Pointrcnn: 3d object proposal generation and detection from point cloud. In: CVPR (2019)
Google Scholar
Shi, S., Wang, Z., Shi, J., Wang, X., Li, H.: From points to parts: 3D object detection from point cloud with part-aware and part-aggregation network. IEEE Trans. Pattern Anal. Mach. Intell. (2020)
Google Scholar
Simonelli, A., Bulo, S.R., Porzi, L., López-Antequera, M., Kontschieder, P.: Disentangling monocular 3d object detection. In: ICCV (2019)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)
Google Scholar
Sun, P., et al.: Sparse r-cnn: end-to-end object detection with learnable proposals. In: CVPR (2021)
Google Scholar
Xiong, K., et al.: Cape: camera view position embedding for multi-view 3d object detection. In: CVPR (2023)
Google Scholar
Yan, Y., Mao, Y., Li, B.: Second: sparsely embedded convolutional detection. Sensors (2018)
Google Scholar
Yang, Z., Sun, Y., Liu, S., Jia, J.: 3dssd: point-based 3d single stage object detector. In: CVPR (2020)
Google Scholar
Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3d object detection and tracking. In: CVPR (2021)
Google Scholar
Zhang, D., et al.: Sam3d: zero-shot 3d object detection via segment anything model. arXiv preprint arXiv:2306.02245 (2023)
Zhang, D., et al.: A simple vision transformer for weakly semi-supervised 3d object detection. In: ICCV (2023)
Google Scholar
Zhang, Y., Hu, Q., Xu, G., Ma, Y., Wan, J., Guo, Y.: Not all points are equal: learning highly efficient point-based detectors for 3d lidar point clouds. In: CVPR (2022)
Google Scholar
Zhou, D., et al.: Iou loss for 2d/3d object detection. In: 3DV (2019)
Google Scholar
Zhou, Y., Tuzel, O.: Voxelnet: end-to-end learning for point cloud based 3d object detection. In: CVPR (2018)
Google Scholar

Download references

Acknowledgement

This work was supported by the National Science Fund for Distinguished Young Scholars of China (Grant No. 62225603) and the National Undergraduate Training Projects for Innovation and Entrepreneurship (202310487020).

Author information

Authors and Affiliations

Huazhong University of Science and Technology, Wuhan, China
Xin Zhou, Jinghua Hou, Tingting Yao, Dingkang Liang, Zhe Liu & Xiang Bai
Baidu Inc., Beijing, China
Zhikang Zou & Xiaoqing Ye
JIMU Intelligent Technology Co., Ltd., Wuhan, China
Jianwei Cheng

Authors

Xin Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jinghua Hou
View author publications
You can also search for this author in PubMed Google Scholar
Tingting Yao
View author publications
You can also search for this author in PubMed Google Scholar
Dingkang Liang
View author publications
You can also search for this author in PubMed Google Scholar
Zhe Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhikang Zou
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoqing Ye
View author publications
You can also search for this author in PubMed Google Scholar
Jianwei Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Bai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiang Bai .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Xiamen University, Xiamen, China
Hanzi Wang
Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma
Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Peking University, Beijing, China
Hongbin Zha
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Chinese Academy of Sciences, Beijing, China
Liang Wang
Xiamen University, Xiamen, China
Rongrong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, X. et al. (2024). Diffusion-Based 3D Object Detection with Random Boxes. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14426. Springer, Singapore. https://doi.org/10.1007/978-981-99-8432-9_3

Download citation

DOI: https://doi.org/10.1007/978-981-99-8432-9_3
Published: 24 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8431-2
Online ISBN: 978-981-99-8432-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Diffusion-Based 3D Object Detection with Random Boxes