Skip to main content

Diffusion-Based 3D Object Detection with Random Boxes

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2023)

Abstract

3D object detection is an essential task for achieving autonomous driving. Existing anchor-based detection methods rely on empirical heuristics setting of anchors, which makes the algorithms lack elegance. In recent years, we have witnessed the rise of several generative models, among which diffusion models show great potential for learning the transformation of two distributions. Our proposed Diff3Det migrates the diffusion model to proposal generation for 3D object detection by considering the detection boxes as generative targets. During training, the object boxes diffuse from the ground truth boxes to the Gaussian distribution, and the decoder learns to reverse this noise process. In the inference stage, the model progressively refines a set of random boxes to the prediction results. We provide detailed experiments on the KITTI benchmark and achieve promising performance compared to classical anchor-based 3D detection methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Amit, T., Nachmani, E., Shaharbany, T., Wolf, L.: Segdiff: image segmentation with diffusion probabilistic models. arXiv preprint arXiv:2112.00390 (2021)

  2. Bai, X., et al.: Transfusion: robust lidar-camera fusion for 3d object detection with transformers. In: CVPR (2022)

    Google Scholar 

  3. Brempong, E.A., Kornblith, S., Chen, T., Parmar, N., Minderer, M., Norouzi, M.: Denoising pretraining for semantic segmentation. In: CVPR (2022)

    Google Scholar 

  4. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

  5. Chen, S., Sun, P., Song, Y., Luo, P.: Diffusiondet: diffusion model for object detection. arXiv preprint arXiv:2211.09788 (2022)

  6. Chen, T., Li, L., Saxena, S., Hinton, G., Fleet, D.J.: A generalist framework for panoptic segmentation of images and videos. arXiv preprint arXiv:2210.06366 (2022)

  7. Chen, T., Zhang, R., Hinton, G.: Analog bits: generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202 (2022)

  8. Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3d object detection network for autonomous driving. In: CVPR (2017)

    Google Scholar 

  9. Chen, Y., Liu, J., Zhang, X., Qi, X., Jia, J.: Voxelnext: fully sparse voxelnet for 3d object detection and tracking. In: CVPR (2023)

    Google Scholar 

  10. Deng, J., Shi, S., Li, P., Zhou, W., Zhang, Y., Li, H.: Voxel r-cnn: towards high performance voxel-based 3d object detection. In: AAAI (2021)

    Google Scholar 

  11. Duan, Y., Guo, X., Zhu, Z.: Diffusiondepth: diffusion denoising approach for monocular depth estimation. arXiv preprint arXiv:2303.05021 (2023)

  12. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: CVPR (2012)

    Google Scholar 

  13. Gu, S., et al.: Vector quantized diffusion model for text-to-image synthesis. In: CVPR (2022)

    Google Scholar 

  14. Han, J., Wan, Z., Liu, Z., Feng, J., Zhou, B.: Sparsedet: towards end-to-end 3d object detection. In: VISAPP (2022)

    Google Scholar 

  15. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)

    Google Scholar 

  16. Huang, T., Liu, Z., Chen, X., Bai, X.: EPNet: enhancing point features with image semantics for 3D object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 35–52. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_3

  17. Ji, Y., et al.: DDP: diffusion model for dense visual prediction. arXiv preprint arXiv:2303.17559 (2023)

  18. Ku, J., Mozifian, M., Lee, J., Harakeh, A., Waslander, S.L.: Joint 3d proposal generation and object detection from view aggregation. In: IROS (2018)

    Google Scholar 

  19. Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: fast encoders for object detection from point clouds. In: CVPR (2019)

    Google Scholar 

  20. Li, J., Liu, Z., Hou, J., Liang, D.: Dds3d: dense pseudo-labels with dynamic threshold for semi-supervised 3d object detection. In: ICRA (2023)

    Google Scholar 

  21. Li, X., et al.: Logonet: towards accurate 3d object detection with local-to-global cross-modal fusion. In: CVPR (2023)

    Google Scholar 

  22. Li, Y., et al.: Bevdepth: acquisition of reliable depth for multi-view 3d object detection. In: AAAI (2023)

    Google Scholar 

  23. Liang, M., Yang, B., Wang, S., Urtasun, R.: Deep continuous fusion for multi-sensor 3D object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 663–678. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_39

  24. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV (2017)

    Google Scholar 

  25. Liu, N., Li, S., Du, Y., Torralba, A., Tenenbaum, J.B.: Compositional visual generation with composable diffusion models. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13677, pp. 423–439. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19790-1_26

  26. Liu, Y., Wang, T., Zhang, X., Sun, J.: PETR: position embedding transformation for multi-view 3D object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13687, pp. 531–548. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19812-0_31

  27. Liu, Z., Huang, T., Li, B., Chen, X., Wang, X., Bai, X.: Epnet++: cascade bi-directional fusion for multi-modal 3d object detection. IEEE Trans. Pattern Anal. Mach. Intell. (2022)

    Google Scholar 

  28. Liu, Z., Zhao, X., Huang, T., Hu, R., Zhou, Y., Bai, X.: Tanet: robust 3d object detection from point clouds with triple attention. In: AAAI (2020)

    Google Scholar 

  29. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  30. Noh, J., Lee, S., Ham, B.: Hvpr: hybrid voxel-point representation for single-stage 3d object detection. In: CVPR (2021)

    Google Scholar 

  31. Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3d object detection from rgb-d data. In: CVPR (2018)

    Google Scholar 

  32. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In: NeurIPS (2017)

    Google Scholar 

  33. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

    Google Scholar 

  34. Shi, G., Li, R., Ma, C.: PillarNet: real-time and high-performance pillar-based 3D object detection. In: Avidan, S., et al. (eds.) ECCV 2022. LNCS, vol. 13670, pp. 35–52. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20080-9_3

  35. Shi, S., et al.: Pv-rcnn: point-voxel feature set abstraction for 3d object detection. In: CVPR (2020)

    Google Scholar 

  36. Shi, S., Wang, X., Li, H.: Pointrcnn: 3d object proposal generation and detection from point cloud. In: CVPR (2019)

    Google Scholar 

  37. Shi, S., Wang, Z., Shi, J., Wang, X., Li, H.: From points to parts: 3D object detection from point cloud with part-aware and part-aggregation network. IEEE Trans. Pattern Anal. Mach. Intell. (2020)

    Google Scholar 

  38. Simonelli, A., Bulo, S.R., Porzi, L., López-Antequera, M., Kontschieder, P.: Disentangling monocular 3d object detection. In: ICCV (2019)

    Google Scholar 

  39. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)

    Google Scholar 

  40. Sun, P., et al.: Sparse r-cnn: end-to-end object detection with learnable proposals. In: CVPR (2021)

    Google Scholar 

  41. Xiong, K., et al.: Cape: camera view position embedding for multi-view 3d object detection. In: CVPR (2023)

    Google Scholar 

  42. Yan, Y., Mao, Y., Li, B.: Second: sparsely embedded convolutional detection. Sensors (2018)

    Google Scholar 

  43. Yang, Z., Sun, Y., Liu, S., Jia, J.: 3dssd: point-based 3d single stage object detector. In: CVPR (2020)

    Google Scholar 

  44. Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3d object detection and tracking. In: CVPR (2021)

    Google Scholar 

  45. Zhang, D., et al.: Sam3d: zero-shot 3d object detection via segment anything model. arXiv preprint arXiv:2306.02245 (2023)

  46. Zhang, D., et al.: A simple vision transformer for weakly semi-supervised 3d object detection. In: ICCV (2023)

    Google Scholar 

  47. Zhang, Y., Hu, Q., Xu, G., Ma, Y., Wan, J., Guo, Y.: Not all points are equal: learning highly efficient point-based detectors for 3d lidar point clouds. In: CVPR (2022)

    Google Scholar 

  48. Zhou, D., et al.: Iou loss for 2d/3d object detection. In: 3DV (2019)

    Google Scholar 

  49. Zhou, Y., Tuzel, O.: Voxelnet: end-to-end learning for point cloud based 3d object detection. In: CVPR (2018)

    Google Scholar 

Download references

Acknowledgement

This work was supported by the National Science Fund for Distinguished Young Scholars of China (Grant No. 62225603) and the National Undergraduate Training Projects for Innovation and Entrepreneurship (202310487020).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiang Bai .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhou, X. et al. (2024). Diffusion-Based 3D Object Detection with Random Boxes. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14426. Springer, Singapore. https://doi.org/10.1007/978-981-99-8432-9_3

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8432-9_3

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8431-2

  • Online ISBN: 978-981-99-8432-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics