Abstract
3D object detection has received extensive attention from researchers. RGB-D sensors are often used for the information complementary in 3D object detection tasks due to their easy acquisition of aligned point cloud and RGB image data, relatively reasonable prices, and reliable performance. However, how to effectively fuse point cloud data and RGB image data in RGB-D images, and use this cross-modal information to improve the performance of 3D object detection, remains a challenge for further research. To deal with these problems, an improved dense-to-sparse cross-modal fusion network for 3D object detection in RGB-D images is proposed in this paper. First, a dense-to-sparse cross-modal learning module (DCLM) is designed, which reduces information waste in the interaction between 2D dense information and 3D sparse information. Then, an inter-modal attention fusion module (IAFM) is designed, which can retain more meaningful information adaptively in the fusion process for the 2D and 3D features. In addition, an intra-modal attention context aggregation module (IACAM) is designed to aggregate context information in both 2D and 3D modalities, and model the relationship between objects. Finally, the detailed quantitative and qualitative experiments are carried out on the SUN RGB-D dataset, and the results show that the proposed model can obtain state-of-the-art 3D object detection results.
Similar content being viewed by others
Data availability statement
Publicly available datasets were analyzed in this study. This data can be found here: https://rgbd.cs.princeton.edu/data/.
References
Araki R, Hirakawa T, Yamashita T, Fujiyoshi H (2022) MT-DSSD: multi-task deconvolutional single shot detector for object detection, segmentation, and grasping detection. Advanced Robotics 36(8):373–387. https://doi.org/10.1080/01691864.2022.2043183
Bai, X, Hu, Z, Zhu, X, Huang, Q, Chen, Y, Fu, H, Tai, C-L (2022) Transfusion: Robust lidar-camera fusion for 3D object detection with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR):New Orleans, LA, USA, pp 090–1099. https://doi.org/10.1109/CVPR52688.2022.00116
Chang, J.-R, Chen, Y-S (2018) Pyramid stereo matching network. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, Salt Lake City, UT, United States, pp 5410–5418. https://doi.org/10.1109/CVPR.2018.00567
Chen, Z, Huang, S, Tao, D (2018) Context refinement for object detection. In: Lecture Notes in Computer Science (including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics):vol 11212 LNCS. Munich, Germany, pp 74–89. https://doi.org/10.1007/978-3-030-01237-3_5
Chen, J, Lei, B, Song, Q, Ying, H, Chen, DZ, Wu, J (2020) A hierarchical graph network for 3D object detection on point clouds. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, Virtual, Online, United States, pp 389–398. https://doi.org/10.1109/CVPR42600.2020.00047
Chen, Z, Li, Z, Zhang, S, Fang, L, Jiang, Q, Zhao, F (2022) AutoAlignV2: Deformable feature aggregation for dynamic multi-modal 3D object detection. arXiv:2207.10316https://doi.org/10.48550
Cheng, B, Sheng, L, Shi, S, Yang, M, Xu, D (2021) Back-tracing representative points for voting-based 3D object detection in point clouds. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, Virtual, Online, United States, pp 8959–8968. https://doi.org/10.1109/CVPR46437.2021.00885
Dai, A, Chang, AX, Savva, M, Halber, M, Funkhouser, T, Niecner, M (2017) ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In: Proceedings - 30th IEEE conference on computer vision and pattern recognition, CVPR 2017, vol 2017-January. Honolulu, HI, United States, pp 2432–2443. https://doi.org/10.1109/CVPR.2017.261
Ding, M, Huo, Y, Yi, H, Wang, Z, Shi, J, Lu, Z, Luo, P (2020) Learning depth-guided convolutions for monocular 3d object detection. In: Proceedings of the IEEE computer society conference on computer Vision and Pattern Recognition, Virtual, Online, United States, pp 11669–11678. https://doi.org/10.1109/CVPR42600.2020.01169
Engelcke, M, Rao, D, Wang, D.Z, Tong, C.H, Posner, I (2017) Vote3Deep: Fast object detection in 3D point clouds using efficient convolutional neural networks. In: Proceedings - IEEE international conference on robotics and automation, vol 0. Singapore, Singapore, pp 1355–1361. https://doi.org/10.1109/ICRA.2017.7989161
Fu, H, Gong, M, Wang, C, Batmanghelich, K, Tao, D (2018) Deep ordinal regression network for monocular depth estimation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, Salt Lake City, UT, United States, pp 2002–2011. https://doi.org/10.1109/CVPR.2018.00214
Gao Z, Zhai G, Deng H, Yang X (2020) Extended geometric models for stereoscopic 3D with vertical screen disparity. Displays 65:101972. https://doi.org/10.1016/j.displa.2020.101972
Gupta, S, Arbelaez, P, Girshick, R, Malik, J (2015) Aligning 3D models to RGB-D images of cluttered scenes. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol 07-12-June-2015. Boston, MA, United States, pp 4731–4740. https://doi.org/10.1109/CVPR.2015.7299105
Gupta, S, Girshick, R, Arbelaez, P, Malik, J (2014) Learning rich features from RGB-D images for object detection and segmentation. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics):vol 8695 LNCS. Zurich, Switzerland, pp 345–360. https://doi.org/10.1007/978-3-319-10584-0_23
Huang, S, Xie, Y, Zhu, S.-C, Zhu, Y (2021) Spatio-temporal self-supervised representation learning for 3D point clouds. In: Proceedings of the IEEE International Conference on Computer Vision, Virtual, Online, Canada, pp 6515–6525. https://doi.org/10.1109/ICCV48922.2021.00647
Jeon G, Anisetti M, Damiani E, Kantarci B (2020) Artificial intelligence in deep learning algorithms for multimedia analysis. Multimedia Tools and Applications 79(45–46):34129–34139. https://doi.org/10.1007/s11042-020-09232-7
Ji C, Liu G, Zhao D (2022) Monocular 3D object detection via estimation of paired keypoints for autonomous driving. Multimedia Tools and Applications 81(4):5973–5988. https://doi.org/10.1007/s11042-021-11801-3
Keselman, L, Woodfill, JI, Grunnet-Jepsen, A, Bhowmik, A (2017) Intel(R) RealSense(TM) stereoscopic depth cameras. In: IEEE computer society conference on computer vision and pattern recognition workshops, vol 2017-July. Honolulu, HI, United States, pp 1267–1276. https://doi.org/10.1109/CVPRW.2017.167
Ku, J, Mozifian, M, Lee, J, Harakeh, A, Waslander, SL (2018) Joint 3D proposal generation and object detection from view aggregation. In: IEEE International Conference on Intelligent Robots and Systems, Madrid, Spain, pp 5750–5757. https://doi.org/10.1109/IROS.2018.8594049
Lahoud, J, Ghanem, B (2017) 2D-Driven 3D object detection in RGB-D images. In: Proceedings of the IEEE International Conference on Computer Vision, vol 2017-October. Venice, Italy, pp 4632–4640. https://doi.org/10.1109/ICCV.2017.495
Li, B, Ouyang, W, Sheng, L, Zeng, X, Wang, X (2020) GS3D: An efficient 3D object detection framework for autonomous driving. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2019-June. Long Beach, CA, United States, pp 1019–1028. https://doi.org/10.1109/CVPR.2019.00111
Li, Y, Qi, X, Chen, Y, Wang, L, Li, Z, Sun, J, Jia, J (2022) Voxel field fusion for 3D object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR):New Orleans, LA, USA, pp 1120–1129. https://doi.org/10.1109/CVPR52688.2022.00119
Li J, Liang X, Shen S, Xu T, Feng J, Yan S (2018) Scale-aware Fast R-CNN for pedestrian detection. IEEE Transactions on Multimedia 20(4):985–996. https://doi.org/10.1109/TMM.2017.2759508
Li Y, Ma L, Tan W, Sun C, Cao D, Li J (2020) GRNet: Geometric relation network for 3D object detection from point clouds. ISPRS Journal of Photogrammetry and Remote Sensing 165:43–53. https://doi.org/10.1016/j.isprsjprs.2020.05.008
Li L, Wan Z, He H (2021) Incomplete multi-view clustering with joint partition and graph learning. IEEE Transactions on Knowledge and Data Engineering 35(1):589–602. https://doi.org/10.1109/TKDE.2021.3082470
Liu, Z, Zhang, Z, Cao, Y, Hu, H, Tong, X (2021) Group-free 3D object detection via transformers. In: Proceedings of the IEEE international conference on computer vision, Virtual, Online, Canada, pp 2929–2938. https://doi.org/10.1109/ICCV48922.2021.00294
Liu B, Wu H, Su W, Zhang W, Sun J (2018) Rotation-invariant object detection using sector-ring HOG and boosted random ferns. Visual Computer 34(5):707–719. https://doi.org/10.1007/s00371-017-1408-3
Lu Y-F, Yu Q, Gao J-W, Li Y, Zou J-C, Qiao H (2022) Cross stage partial connections based weighted bi-directional feature pyramid and enhanced spatial transformation network for robust object detection. Neurocomputing 513:70–82. https://doi.org/10.1016/j.neucom.2022.09.117
Luo, S, Dai, H, Shao, L, Ding, Y (2021) M3DSSD: Monocular 3D single stage object detector. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, Virtual, Online, United States, pp 6141–6150. https://doi.org/10.1109/CVPR46437.2021.00608
Luo Q, Ma H, Tang L, Wang Y, Xiong R (2020) 3D-SSD: Learning hierarchical features from RGB-D images for amodal 3D object detection. Neurocomputing 378:364–374. https://doi.org/10.1016/j.neucom.2019.10.025
Misra, I, Girdhar, R, Joulin, A (2021) An end-to-end transformer model for 3D object detection. In: Proceedings of the IEEE international conference on computer vision, Virtual, Online, Canada, pp 2886–2897. https://doi.org/10.1109/ICCV48922.2021.00290
Mousavian, A, Anguelov, D, Koecka, J, Flynn, J (2017) 3D bounding box estimation using deep learning and geometry. In: Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-January. Honolulu, HI, United States, pp 5632–5640. https://doi.org/10.1109/CVPR.2017.597
Ni J, Chen Y, Chen Y, Zhu J, Ali D, Cao W (2020) A survey on theories and applications for self-driving cars based on deep learning methods. Applied Sciences-Basel 10(8):2749. https://doi.org/10.3390/app10082749
Ni J, Shen K, Chen Y, Cao W, Yang SX (2022) An improved deep network-based scene classification method for self-driving cars. IEEE Transactions on Instrumentation and Measurement 71:5001614. https://doi.org/10.1109/TIM.2022.3146923
Qi, C.R, Chen, X, Litany, O, Guibas, LJ (2020) ImVoteNet: Boosting 3D object detection in point clouds with image votes. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, Virtual, Online, United States, pp 4403–4412. https://doi.org/10.1109/CVPR42600.2020.00446
Qi, C.R, Litany, O, He, K, Guibas, L (2019) Deep hough voting for 3D object detection in point clouds. In: Proceedings of the IEEE international conference on computer vision, vol 2019-October. Seoul, Korea, Republic of, pp 9276–9285. https://doi.org/10.1109/ICCV.2019.00937
Qi, C.R, Liu, W, Wu, C, Su, H, Guibas, LJ (2018) Frustum pointnets for 3D object detection from RGB-D data. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, United States, pp 918–927. https://doi.org/10.1109/CVPR.2018.00102
Qi, C.R, Su, H, Mo, K, Guibas, LJ (2017) PointNet: Deep learning on point sets for 3D classification and segmentation. In: Proceedings - 30th IEEE conference on computer vision and pattern recognition, CVPR 2017, vol 2017-January. Honolulu, HI, United States, pp 77–85. https://doi.org/10.1109/CVPR.2017.16
Qi CR, Yi L, Su H, Guibas LJ (2017) PointNet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, vol 2017-December. Long Beach, CA, United States, pp 5100–5109
Rahman MM, Tan Y, Xue J, Lu K (2020) Notice of removal: Recent advances in 3d object detection in the era of deep neural networks: A survey. IEEE Transactions on Image Processing 29:2947–2962. https://doi.org/10.1109/TIP.2019.2955239
Ren Z, Sudderth EB (2020) Clouds of oriented gradients for 3D detection of objects, surfaces, and indoor scene layouts. IEEE Transactions on Pattern Analysis and Machine Intelligence 42(10):2670–2683. https://doi.org/10.1109/TPAMI.2019.2923201
Ren Y, Chen C, Li S, Kuo C-CJ (2018) Context-assisted 3D (C3D) object detection from RGB-D images. Journal of Visual Communication and Image Representation 55:131–141. https://doi.org/10.1016/j.jvcir.2018.05.019
Rosten E, Porter R, Drummond T (2010) Faster and better: A machine learning approach to corner detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(1):105–119. https://doi.org/10.1109/TPAMI.2008.275
Shi, S, Wang, X, Li, H (2019) PointRCNN: 3D object proposal generation and detection from point cloud. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2019-June. Long Beach, CA, United States, pp 770–779. https://doi.org/10.1109/CVPR.2019.00086
Silberman, N, Hoiem, D, Kohli, P, Fergus, R (2012) Indoor segmentation and support inference from RGBD images. In: Lecture notes in computer science (including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics):vol 7576 LNCS. Florence, Italy, pp 746–760. https://doi.org/10.1007/978-3-642-33715-4_54
Song, S, Lichtenberg, S.P, Xiao, J (2015) SUN RGB-D: A RGB-D scene understanding benchmark suite. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol 07-12-June-2015. Boston, MA, United States, pp 567–576. https://doi.org/10.1109/CVPR.2015.7298655
Song, S, Xiao, J (2014) Sliding shapes for 3D object detection in depth images. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics):vol 8694 LNCS. Zurich, Switzerland, pp 634–651. https://doi.org/10.1007/978-3-319-10599-4_41
Song, S, Xiao, J (2016) Deep sliding shapes for amodal 3D object detection in RGB-D images. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2016-December. Las Vegas, NV, United States, pp 808–816. https://doi.org/10.1109/CVPR.2016.94
Sun, R, Qian, J, Jose, R.H, Gong, Z, Miao, R, Xue, W, Liu, P (2020) A flexible and efficient real-time ORB-based full-HD image feature extraction accelerator. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 28(2):565–575. https://doi.org/10.1109/TVLSI.2019.2945982
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems, vol 2017-December. Long Beach, CA, United States, pp 5999–6009
Wang, Y, Chen, X, Cao, L, Huang, W, Sun, F, Wang, Y (2022) Multimodal token fusion for vision transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR):New Orleans, LA, USA, pp 12186–12195. https://doi.org/10.1109/CVPR52688.2022.01187
Wang, H, Shi, S, Yang, Z, Fang, R, Qian, Q, Li, H, Schiele, B, Wang, L (2022) RBGNet: Ray-based grouping for 3D object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR):New Orleans, LA, USA, pp 1110–1119. https://doi.org/10.1109/CVPR52688.2022.00118
Wang, W, Tran, D, Feiszli, M (2020) What makes training multi-modal classification networks hard? In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Virtual, Online, United States, pp 12692–12702. https://doi.org/10.1109/CVPR42600.2020.01271
Wang, Y, Ye, T, Cao, L, Huang, W, Sun, F, He, F, Tao, D (2022) Bridged transformer for vision and point cloud 3D object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR):New Orleans, LA, USA, pp 12114–12123. https://doi.org/10.1109/CVPR52688.2022.01180
Wang Y, Wang C, Long P, Gu Y, Li W (2021) Recent advances in 3D object detection based on RGB-D: A survey. Displays 70:102077. https://doi.org/10.1016/j.displa.2021.102077
Wang Z, Xie Q, Wei M, Long K, Wang J (2022) Multi-feature fusion VoteNet for 3D object detection. ACM Transactions on Multimedia Computing, Communications and Applications 18(1):6. https://doi.org/10.1145/3462219
Woodford OJ, Pham M-T, Maki A, Perbet F, Stenger B (2014) Demisting the hough transform for 3d shape recognition and registration. International Journal of Computer Vision 106(3):332–341. https://doi.org/10.1007/s11263-013-0623-2
Xiao, J, Owens, A, Torralba, A (2013) SUN3D: A database of big spaces reconstructed using SfM and object labels. In: Proceedings of the IEEE international conference on computer vision, Sydney, NSW, Australia, pp 1625–1632. https://doi.org/10.1109/ICCV.2013.458
Xiao Y, Tian Z, Yu J, Zhang Y, Liu S, Du S, Lan X (2020) A review of object detection based on deep learning. Multimedia Tools and Applications 79(33–34):23729–23791. https://doi.org/10.1007/s11042-020-08976-6
Xie Q, Lai Y-K, Wu J, Wang Z, Zhang Y, Xu K, Wang J (2021) Vote-based 3D object detection with context modeling and SOB-3DNMS. International Journal of Computer Vision 129(6):1857–1874. https://doi.org/10.1007/s11263-021-01456-w
Xu, D, Anguelov, D, Jain, A (2018) PointFusion: Deep sensor fusion for 3D bounding box estimation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, Salt Lake City, UT, United States, pp 244–253. https://doi.org/10.1109/CVPR.2018.00033
Xu, B, Chen, Z (2018) Multi-level fusion based 3D object detection from monocular images. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, Salt Lake City, UT, United States, pp 2345–2353. https://doi.org/10.1109/CVPR.2018.00249
Zhang, Y, Chen, J, Huang, D (2022) CAT-Det: Contrastively augmented transformer for multi-modal 3D object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR):New Orleans, LA, USA, pp 908–917. https://doi.org/10.1109/CVPR52688.2022.00098
Zhang Z (2012) Microsoft kinect sensor and its effect. IEEE Multimedia 19(2):4–10. https://doi.org/10.1109/MMUL.2012.24
Zhang M, Xu S, Song W, He Q (2021) Wei, Q (2021) Lightweight underwater object detection based on YOLO v4 and multi-scale attentional feature fusion. Remote Sensing 13(22):4706. https://doi.org/10.3390/rs13224706
Zhang L, Li W, Yu L, Sun L, Dong X, Ning X (2021) GmFace: An explicit function for face image representation. Displays 68:102022. https://doi.org/10.1016/j.displa.2021.102022
Zhao L, Guo J, Xu D, Sheng L (2021) Transformer3D-Det: Improving 3D object detection by vote refinement. IEEE Transactions on Circuits and Systems for Video Technology 31(12):4735–4746. https://doi.org/10.1109/TCSVT.2021.3102025
Zhou, Z, Fan, X, Shi, P, Xin, Y (2021) R-MSFM: Recurrent multi-scale feature modulation for monocular depth estimating. In: Proceedings of the IEEE international conference on computer vision, Virtual, Online, Canada, pp 12757–12766. https://doi.org/10.1109/ICCV48922.2021.01254
Zhou, Y, Tuzel, O (2018) VoxelNet: End-to-end learning for point cloud based 3D object detection. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, Salt Lake City, UT, United States, pp 4490–4499. https://doi.org/10.1109/CVPR.2018.00472
Zhou H, Yuan Y, Shi C (2009) Object tracking using SIFT features and mean shift. Computer Vision and Image Understanding 113(3):345–352. https://doi.org/10.1016/j.cviu.2008.08.006
Acknowledgements
This work was supported by National Natural Science Foundation of China (61873086) and the Science and Technology Support Program of Changzhou (CE20215022).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declared that they have no conflicts of interest to this work.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, Y., Ni, J., Tang, G. et al. An improved dense-to-sparse cross-modal fusion network for 3D object detection in RGB-D images. Multimed Tools Appl 83, 12159–12184 (2024). https://doi.org/10.1007/s11042-023-15845-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15845-5