skip to main content
survey

Deep Learning on Monocular Object Pose Detection and Tracking: A Comprehensive Overview

Authors Info & Claims
Published:21 November 2022Publication History
Skip Abstract Section

Abstract

Object pose detection and tracking has recently attracted increasing attention due to its wide applications in many areas, such as autonomous driving, robotics, and augmented reality. Among methods for object pose detection and tracking, deep learning is the most promising one that has shown better performance than others. However, survey study about the latest development of deep learning-based methods is lacking. Therefore, this study presents a comprehensive review of recent progress in object pose detection and tracking that belongs to the deep learning technical route. To achieve a more thorough introduction, the scope of this study is limited to methods taking monocular RGB/RGBD data as input and covering three kinds of major tasks: instance-level monocular object pose detection, category-level monocular object pose detection, and monocular object pose tracking. In our work, metrics, datasets, and methods of both detection and tracking are presented in detail. Comparative results of current state-of-the-art methods on several publicly available datasets are also presented, together with insightful observations and inspiring future research directions.

REFERENCES

  1. [1] Ahmadyan Adel, Hou Tingbo, Wei Jianing, Zhang Liangkai, Ablavatski Artsiom, and Grundmann Matthias. 2020. Instant 3D object tracking with applications in augmented reality. arXiv preprint arXiv:2006.13194 (2020).Google ScholarGoogle Scholar
  2. [2] Ahmadyan Adel, Zhang Liangkai, Ablavatski Artsiom, Wei Jianing, and Grundmann Matthias. 2021. Objectron: A large scale dataset of object-centric videos in the wild with pose annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 78227831.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Arnold Eduardo, Al-Jarrah Omar Y., Dianati Mehrdad, Fallah Saber, Oxtoby David, and Mouzakitis Alex. 2019. A survey on 3D object detection methods for autonomous driving applications. IEEE Trans. Intell. Transport. Syst. 20, 10 (2019), 37823795.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Balntas Vassileios, Doumanoglou Andreas, Sahin Caner, Sock Juil, Kouskouridas Rigas, and Kim Tae-Kyun. 2017. Pose guided RGBD feature learning for 3D object pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 38563864.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Bousmalis Konstantinos, Irpan Alex, Wohlhart Paul, Bai Yunfei, Kelcey Matthew, Kalakrishnan Mrinal, Downs Laura, Ibarz Julian, Pastor Peter, Konolige Kurt, et al. 2018. Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). IEEE, 42434250.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Brachmann Eric, Krull Alexander, Michel Frank, Gumhold Stefan, Shotton Jamie, and Rother Carsten. 2014. Learning 6D object pose estimation using 3D object coordinates. In Proceedings of the European Conference on Computer Vision. Springer, 536551.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Brachmann Eric, Michel Frank, Krull Alexander, Yang Michael Ying, Gumhold Stefan, et al. 2016. Uncertainty-driven 6D pose estimation of objects and scenes from a single RGB image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 33643372.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Brazil Garrick and Liu Xiaoming. 2019. M3D-RPN: Monocular 3D region proposal network for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 92879296.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Brazil Garrick, Pons-Moll Gerard, Liu Xiaoming, and Schiele Bernt. 2020. Kinematic 3D object detection in monocular video. In Proceedings of the European Conference on Computer Vision. Springer, 135152.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Bukschat Yannick and Vetter Marcus. 2020. EfficientPose–An efficient, accurate and scalable end-to-end 6D multi object pose estimation approach. arXiv preprint arXiv:2011.04307 (2020).Google ScholarGoogle Scholar
  11. [11] Busam Benjamin, Jung Hyun Jun, and Navab Nassir. 2020. I like to move it: 6D pose estimation as an action decision process. arXiv preprint arXiv:2009.12678 (2020).Google ScholarGoogle Scholar
  12. [12] Caesar Holger, Bankiti Varun, Lang Alex H., Vora Sourabh, Liong Venice Erin, Xu Qiang, Krishnan Anush, Pan Yu, Baldan Giancarlo, and Beijbom Oscar. 2020. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1162111631.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Calli Berk, Singh Arjun, Walsman Aaron, Srinivasa Siddhartha, Abbeel Pieter, and Dollar Aaron M.. 2015. The YCB object and model set: Towards common benchmarks for manipulation research. In Proceedings of the International Conference on Advanced Robotics (ICAR). IEEE, 510517.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Chang Angel X., Funkhouser Thomas, Guibas Leonidas, Hanrahan Pat, Huang Qixing, Li Zimo, Savarese Silvio, Savva Manolis, Song Shuran, Su Hao, et al. 2015. ShapeNet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015).Google ScholarGoogle Scholar
  15. [15] Chen Bo, Parra Alvaro, Cao Jiewei, Li Nan, and Chin Tat-Jun. 2020. End-to-end learnable geometric vision by backpropagating PnP optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 81008109.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Chen Dengsheng, Li Jun, Wang Zheng, and Xu Kai. 2020. Learning canonical shape space for category-level 6D object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1197311982.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Chen Guobin, Choi Wongun, Yu Xiang, Han Tony, and Chandraker Manmohan. 2017. Learning efficient object detection models with knowledge distillation. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 742751.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Chen Jiale, Zhang Lijun, Liu Yi, and Xu Chi. 2020. Survey on 6D pose estimation of rigid object. In Proceedings of the 39th Chinese Control Conference (CCC). IEEE, 74407445.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Chen Kai and Dou Qi. 2021. SGPA: Structure-guided prior adaptation for category-level 6D object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 27732782.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Chen Wei, Duan Jinming, Basevi Hector, Chang Hyung Jin, and Leonardis Ales. 2020. PointPoseNet: Point pose network for robust 6D object pose estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 28242833.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Chen Wei, Jia Xi, Chang Hyung Jin, Duan Jinming, and Leonardis Ales. 2020. G2L-Net: Global to local network for real-time 6D pose estimation with embedding vector features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 42334242.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Chen Wei, Jia Xi, Chang Hyung Jin, Duan Jinming, Shen Linlin, and Leonardis Ales. 2021. FS-Net: Fast shape-based network for category-level 6D object pose estimation with decoupled rotation mechanism. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15811590.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Chen Wenzheng, Ling Huan, Gao Jun, Smith Edward, Lehtinen Jaakko, Jacobson Alec, and Fidler Sanja. 2019. Learning to predict 3D objects with an interpolation-based differentiable renderer. Adv. Neural Inf. Process. Syst. 32 (2019), 96099619.Google ScholarGoogle Scholar
  24. [24] Chen Xuzhan, Chen Youping, You Bang, Xie Jingming, and Najjaran Homayoun. 2020. Detecting 6D poses of target objects from cluttered scenes by learning to align the point cloud patches with the CAD models. IEEE Access 8 (2020), 210640210650.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Chen Xu, Dong Zijian, Song Jie, Geiger Andreas, and Hilliges Otmar. 2020. Category level object pose estimation via neural analysis-by-synthesis. In Proceedings of the European Conference on Computer Vision. Springer, 139156.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Chen Xiaozhi, Kundu Kaustav, Zhang Ziyu, Ma Huimin, Fidler Sanja, and Urtasun Raquel. 2016. Monocular 3D object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 21472156.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Cipresso Pietro, Giglioli Irene Alice Chicchi, Raya Mariano Alcañiz, and Riva Giuseppe. 2018. The past, present, and future of virtual and augmented reality research: A network and cluster analysis of the literature. Front. Psychol. 9 (2018), 2086.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Deng Xinke, Mousavian Arsalan, Xiang Yu, Xia Fei, Bretl Timothy, and Fox Dieter. 2021. PoseRBPF: A Rao–Blackwellized particle filter for 6-D object pose tracking. IEEE Trans. Robot. 37, 5 (2021), 1328–1342.Google ScholarGoogle Scholar
  29. [29] Deng Xinke, Xiang Yu, Mousavian Arsalan, Eppner Clemens, Bretl Timothy, and Fox Dieter. 2020. Self-supervised 6D object pose estimation for robot manipulation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). IEEE, 36653671.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Ding Mingyu, Huo Yuqi, Yi Hongwei, Wang Zhe, Shi Jianping, Lu Zhiwu, and Luo Ping. 2020. Learning depth-guided convolutions for monocular 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 10001001.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Dosovitskiy Alexey, Fischer Philipp, Ilg Eddy, Hausser Philip, Hazirbas Caner, Golkov Vladimir, Smagt Patrick Van Der, Cremers Daniel, and Brox Thomas. 2015. FlowNet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 27582766.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Du Guoguang, Wang Kai, and Lian Shiguo. 2019. Vision-based robotic grasping from object localization pose estimation grasp detection to motion planning: A review. arXiv preprint arXiv:1905.06658 (2019).Google ScholarGoogle Scholar
  33. [33] Duan Kaiwen, Bai Song, Xie Lingxi, Qi Honggang, Huang Qingming, and Tian Qi. 2019. CenterNet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 65696578.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Esteves Carlos, Allen-Blanchette Christine, Makadia Ameesh, and Daniilidis Kostas. 2018. Learning SO (3) equivariant representations with spherical CNNs. In Proceedings of the European Conference on Computer Vision (ECCV). 5268.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Fan Zhaoxin, Song Zhengbo, Xu Jian, Wang Zhicheng, Wu Kejian, Liu Hongyan, and He Jun. 2021. ACR-Pose: Adversarial canonical representation reconstruction network for category level 6D object pose estimation. arXiv preprint arXiv:2111.10524 (2021).Google ScholarGoogle Scholar
  36. [36] Fernandes Duarte, Silva António, Névoa Rafael, Simões Cláudia, Gonzalez Dibet, Guevara Miguel, Novais Paulo, Monteiro João, and Melo-Pinto Pedro. 2021. Point-cloud based 3D object detection and classification methods for self-driving applications: A survey and taxonomy. Inf. Fusion 68 (2021), 161191.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Gao Ge, Lauri Mikko, Hu Xiaolin, Zhang Jianwei, and Frintrop Simone. 2021. CloudAAE: Learning 6D object pose regression with on-line data synthesis on point clouds. arXiv preprint arXiv:2103.01977 (2021).Google ScholarGoogle Scholar
  38. [38] Gao Ge, Lauri Mikko, Wang Yulong, Hu Xiaolin, Zhang Jianwei, and Frintrop Simone. 2020. 6D object pose regression via supervised learning on point clouds. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). IEEE, 36433649.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Gao Tianze, Pan Huihui, and Gao Huijun. 2020. Monocular 3D object detection with sequential feature association and depth hint augmentation. arXiv preprint arXiv:2011.14589 (2020).Google ScholarGoogle Scholar
  40. [40] Garon Mathieu and Lalonde Jean-François. 2017. Deep 6-DOF tracking. IEEE Trans. Visualiz. Comput. Graph. 23, 11 (2017), 24102418.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Gattullo Michele, Scurati Giulia Wally, Fiorentino Michele, Uva Antonio Emmanuele, Ferrise Francesco, and Bordegoni Monica. 2019. Towards augmented reality manuals for industry 4.0: A methodology. Robot. Comput.-integ. Manuf. 56 (2019), 276286.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Geiger Andreas, Lenz Philip, and Urtasun Raquel. 2012. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 33543361.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Grigorescu Sorin, Trasnea Bogdan, Cocias Tiberiu, and Macesanu Gigel. 2020. A survey of deep learning techniques for autonomous driving. J. Field Robot. 37, 3 (2020), 362386.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Guo Yulan, Wang Hanyun, Hu Qingyong, Liu Hao, Liu Li, and Bennamoun Mohammed. 2020. Deep learning for 3D point clouds: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 43, 12 (2020), 4338–4364.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] He Kaiming, Gkioxari Georgia, Dollár Piotr, and Girshick Ross. 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 29612969.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770778.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] He Yisheng, Huang Haibin, Fan Haoqiang, Chen Qifeng, and Sun Jian. 2021. FFB6D: A full flow bidirectional fusion network for 6D pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 30033013.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] He Yisheng, Sun Wei, Huang Haibin, Liu Jianran, Fan Haoqiang, and Sun Jian. 2020. PVN3D: A deep point-wise 3D keypoints voting network for 6DoF pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1163211641.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Hinterstoisser Stefan, Cagniart Cedric, Ilic Slobodan, Sturm Peter, Navab Nassir, Fua Pascal, and Lepetit Vincent. 2011. Gradient response maps for real-time detection of textureless objects. IEEE Trans. Pattern Anal. Mach. Intell. 34, 5 (2011), 876888.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Hinterstoisser Stefan, Holzer Stefan, Cagniart Cedric, Ilic Slobodan, Konolige Kurt, Navab Nassir, and Lepetit Vincent. 2011. Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In Proceedings of the International Conference on Computer Vision. IEEE, 858865.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Hinterstoisser Stefan, Lepetit Vincent, Ilic Slobodan, Holzer Stefan, Bradski Gary, Konolige Kurt, and Navab Nassir. 2012. Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In Proceedings of the Asian Conference on Computer Vision. Springer, 548562.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Computat. 9, 8 (1997), 17351780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Hodan Tomas, Barath Daniel, and Matas Jiri. 2020. EPOS: Estimating 6D pose of objects with symmetries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1170311712.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Hodan Tomáš, Haluza Pavel, Obdržálek Štepán, Matas Jiri, Lourakis Manolis, and Zabulis Xenophon. 2017. T-LESS: An RGB-D dataset for 6D pose estimation of texture-less objects. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 880888.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Hodaň Tomáš, Matas Jiří, and Obdržálek Štěpán. 2016. On evaluation of 6D object pose estimation. In Proceedings of the European Conference on Computer Vision. Springer, 606619.Google ScholarGoogle Scholar
  56. [56] Hou Tingbo, Ahmadyan Adel, Zhang Liangkai, Wei Jianing, and Grundmann Matthias. 2020. MobilePose: Real-time pose estimation for unseen objects with weak shape supervision. arXiv preprint arXiv:2003.03522 (2020).Google ScholarGoogle Scholar
  57. [57] Hu Hou-Ning, Cai Qi-Zhi, Wang Dequan, Lin Ji, Sun Min, Krahenbuhl Philipp, Darrell Trevor, and Yu Fisher. 2019. Joint monocular 3D vehicle detection and tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 53905399.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Hu Hou-Ning, Yang Yung-Hsu, Fischer Tobias, Darrell Trevor, Yu Fisher, and Sun Min. 2021. Monocular quasi-dense 3D object tracking. arXiv preprint arXiv:2103.07351 (2021).Google ScholarGoogle Scholar
  59. [59] Hu Yinlin, Fua Pascal, Wang Wei, and Salzmann Mathieu. 2020. Single-stage 6D object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 29302939.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Hu Yinlin, Hugonot Joachim, Fua Pascal, and Salzmann Mathieu. 2019. Segmentation-driven 6D object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 33853394.Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Huang Xinyu, Cheng Xinjing, Geng Qichuan, Cao Binbin, Zhou Dingfu, Wang Peng, Lin Yuanqing, and Yang Ruigang. 2018. The ApolloScape dataset for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 954960.Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Huttenlocher Daniel P., Klanderman Gregory A., and Rucklidge William J.. 1993. Comparing images using the Hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 15, 9 (1993), 850863.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Ibáñez María-Blanca and Delgado-Kloos Carlos. 2018. Augmented reality for STEM learning: A systematic review. Comput. Educ. 123 (2018), 109123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. [64] Jafari Omid Hosseini, Mustikovela Siva Karthik, Pertsch Karl, Brachmann Eric, and Rother Carsten. 2018. iPose: Instance-aware 6D pose estimation of partly occluded objects. In Proceedings of the Asian Conference on Computer Vision. Springer, 477492.Google ScholarGoogle Scholar
  65. [65] James Stephen, Wohlhart Paul, Kalakrishnan Mrinal, Kalashnikov Dmitry, Irpan Alex, Ibarz Julian, Levine Sergey, Hadsell Raia, and Bousmalis Konstantinos. 2019. Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1262712637.Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Jörgensen Eskil, Zach Christopher, and Kahl Fredrik. 2019. Monocular 3D object detection and box fitting trained end-to-end using intersection-over-union loss. arXiv preprint arXiv:1906.08070 (2019).Google ScholarGoogle Scholar
  67. [67] Kalman Rudolph Emil. 1960. A new approach to linear filtering and prediction problems. 35–45.Google ScholarGoogle Scholar
  68. [68] Kaskman Roman, Zakharov Sergey, Shugurov Ivan, and Ilic Slobodan. 2019. HomebrewedDB: RGB-D dataset for 6D pose estimation of 3D objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops.Google ScholarGoogle ScholarCross RefCross Ref
  69. [69] Kehl Wadim, Manhardt Fabian, Tombari Federico, Ilic Slobodan, and Navab Nassir. 2017. SSD-6D: Making RGB-based 3D detection and 6D pose estimation great again. In Proceedings of the IEEE International Conference on Computer Vision. 15211529.Google ScholarGoogle ScholarCross RefCross Ref
  70. [70] Kehl Wadim, Milletari Fausto, Tombari Federico, Ilic Slobodan, and Navab Nassir. 2016. Deep learning of local RGB-D patches for 3D object detection and 6D pose estimation. In Proceedings of the European Conference on Computer Vision. Springer, 205220.Google ScholarGoogle ScholarCross RefCross Ref
  71. [71] Kim Yoon and Rush Alexander M.. 2016. Sequence-level knowledge distillation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 13171327.Google ScholarGoogle ScholarCross RefCross Ref
  72. [72] Kleeberger Kilian, Bormann Richard, Kraus Werner, and Huber Marco F.. 2020. A survey on learning-based robotic grasping. Curr. Robot. Rep. 1, 4 (2020), 239249.Google ScholarGoogle ScholarCross RefCross Ref
  73. [73] Krull Alexander, Brachmann Eric, Michel Frank, Yang Michael Ying, Gumhold Stefan, and Rother Carsten. 2015. Learning analysis-by-synthesis for 6D pose estimation in RGB-D images. In Proceedings of the IEEE International Conference on Computer Vision. 954962.Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. [74] Krull Alexander, Michel Frank, Brachmann Eric, Gumhold Stefan, Ihrke Stephan, and Rother Carsten. 2014. 6-DoF model based tracking via object coordinate regression. In Proceedings of the Asian Conference on Computer Vision. Springer, 384399.Google ScholarGoogle Scholar
  75. [75] Kuhn Harold W.. 1955. The Hungarian method for the assignment problem. Naval Res. Logist. Quart. 2, 1–2 (1955), 8397.Google ScholarGoogle ScholarCross RefCross Ref
  76. [76] Lee Taeyeop, Lee Byeong-Uk, Kim Myungchul, and Kweon In So. 2021. Category-level metric scale object shape and pose estimation. IEEE Robot. Automat. Lett. 6, 4 (2021), 85758582.Google ScholarGoogle ScholarCross RefCross Ref
  77. [77] Leeb Felix, Byravan Arunkumar, and Fox Dieter. 2019. Motion-Nets: 6D tracking of unknown objects in unseen environments using RGB. arXiv preprint arXiv:1910.13942 (2019).Google ScholarGoogle Scholar
  78. [78] Lepetit Vincent, Moreno-Noguer Francesc, and Fua Pascal. 2009. EPnP: An accurate O (n) solution to the PnP problem. Int. J. Comput. Vis. 81, 2 (2009), 155.Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. [79] Levinson Jesse, Askeland Jake, Becker Jan, Dolson Jennifer, Held David, Kammel Soeren, Kolter J. Zico, Langer Dirk, Pink Oliver, Pratt Vaughan, et al. 2011. Towards fully autonomous driving: Systems and algorithms. In Proceedings of the IEEE Intelligent Vehicles Symposium. IEEE, 163168.Google ScholarGoogle ScholarCross RefCross Ref
  80. [80] Li Buyu, Ouyang Wanli, Sheng Lu, Zeng Xingyu, and Wang Xiaogang. 2019. GS3D: An efficient 3D object detection framework for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10191028.Google ScholarGoogle ScholarCross RefCross Ref
  81. [81] Li Peixuan and Huaici Zhao. 2021. Monocular 3D detection with geometric constraint embedding and semi-supervised training. IEEE Robot. Automat. Lett. 6, 3 (2021), 5565–5572.Google ScholarGoogle Scholar
  82. [82] Li Peixuan, Zhao Huaici, Liu Pengfei, and Cao Feidao. 2020. RTM3D: Real-time monocular 3D detection from object keypoints for autonomous driving. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. Springer, 644660.Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. [83] Li Xiaolong, Wang He, Yi Li, Guibas Leonidas J., Abbott A. Lynn, and Song Shuran. 2020. Category-level articulated object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 37063715.Google ScholarGoogle ScholarCross RefCross Ref
  84. [84] Li Yi, Wang Gu, Ji Xiangyang, Xiang Yu, and Fox Dieter. 2018. DeepIM: Deep iterative matching for 6D pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV). 683698.Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. [85] Li Zhigang, Hu Yinlin, Salzmann Mathieu, and Ji Xiangyang. 2020. Robust RGB-based 6-DoF pose estimation without real pose annotations. arXiv preprint arXiv:2008.08391 (2020).Google ScholarGoogle Scholar
  86. [86] Li Zhigang, Wang Gu, and Ji Xiangyang. 2019. CDPN: Coordinates-based disentangled pose network for real-time RGB-based 6-DoF object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 76787687.Google ScholarGoogle ScholarCross RefCross Ref
  87. [87] Lin Jiehong, Wei Zewei, Li Zhihao, Xu Songcen, Jia Kui, and Li Yuanqing. 2021. DualPoseNet: Category-level 6D object pose and size estimation using dual pose network with refined learning of pose consistency. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 35603569.Google ScholarGoogle ScholarCross RefCross Ref
  88. [88] Lin Yunzhi, Tremblay Jonathan, Tyree Stephen, Vela Patricio A., and Birchfield Stan. 2021. Single-stage keypoint-based category-level object pose estimation from an RGB image. arXiv preprint arXiv:2109.06161 (2021).Google ScholarGoogle Scholar
  89. [89] Liu Lijie, Lu Jiwen, Xu Chunjing, Tian Qi, and Zhou Jie. 2019. Deep fitting degree scoring network for monocular 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10571066.Google ScholarGoogle ScholarCross RefCross Ref
  90. [90] Liu Lijie, Wu Chufan, Lu Jiwen, Xie Lingxi, Zhou Jie, and Tian Qi. 2020. Reinforced axial refinement network for monocular 3D object detection. In Proceedings of the European Conference on Computer Vision. Springer, 540556.Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. [91] Liu Wei, Anguelov Dragomir, Erhan Dumitru, Szegedy Christian, Reed Scott, Fu Cheng-Yang, and Berg Alexander C.. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision. Springer, 2137.Google ScholarGoogle ScholarCross RefCross Ref
  92. [92] Liu Yuxuan, Yixuan Yuan, and Liu Ming. 2021. Ground-aware monocular 3D object detection for autonomous driving. IEEE Robot. Automat. Lett. 6, 2 (2021), 919926.Google ScholarGoogle ScholarCross RefCross Ref
  93. [93] Liu Ze, Lin Yutong, Cao Yue, Hu Han, Wei Yixuan, Zhang Zheng, Lin Stephen, and Guo Baining. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the International Conference on Computer Vision (ICCV).Google ScholarGoogle Scholar
  94. [94] Liu Zechen, Wu Zizhang, and Tóth Roland. 2020. SMOKE: Single-stage monocular 3D object detection via keypoint estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 996997.Google ScholarGoogle ScholarCross RefCross Ref
  95. [95] Lowe David G.. 2004. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 2 (2004), 91110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. [96] Ma Xinzhu, Liu Shinan, Xia Zhiyi, Zhang Hongwen, Zeng Xingyu, and Ouyang Wanli. 2020. Rethinking pseudo-LiDAR representation. In Proceedings of the European Conference on Computer Vision. Springer, 311327.Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. [97] F. Manhardt, G. Wang, B. Busam, et al. 2020. CPS++: Improving class-level 6D pose and shape estimation from monocular images with self-supervised learning[J]. arXiv preprint arXiv:2003.0584.Google ScholarGoogle Scholar
  98. [98] Majcher Mateusz and Kwolek Bogdan. 2020. 3D model-based 6D object pose tracking on RGB images using particle filtering and heuristic optimization. In VISIGRAPP (5: VISAPP). 690697.Google ScholarGoogle Scholar
  99. [99] Manhardt Fabian, Kehl Wadim, and Gaidon Adrien. 2019. ROI-10D: Monocular lifting of 2D detection to 6D pose and metric shape. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20692078.Google ScholarGoogle ScholarCross RefCross Ref
  100. [100] Manhardt Fabian, Kehl Wadim, Navab Nassir, and Tombari Federico. 2018. Deep model-based 6D pose refinement in RGB. In Proceedings of the European Conference on Computer Vision (ECCV). 800815.Google ScholarGoogle ScholarDigital LibraryDigital Library
  101. [101] Manhardt Fabian, Wang Gu, Busam Benjamin, Nickel Manuel, Meier Sven, Minciullo Luca, Ji Xiangyang, and Navab Nassir. 2020. CPS++: Improving class-level 6D pose and shape estimation from monocular images with self-supervised learning. arXiv e-prints (2020).Google ScholarGoogle Scholar
  102. [102] Marougkas Isidoros, Koutras Petros, Kardaris Nikos, Retsinas Georgios, Chalvatzaki Georgia, and Maragos Petros. 2020. How to track your dragon: A multi-attentional framework for real-time RGB-D 6-DoF object pose tracking. In Proceedings of the European Conference on Computer Vision. Springer, 682699.Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. [103] Masci Jonathan, Meier Ueli, Cireşan Dan, and Schmidhuber Jürgen. 2011. Stacked convolutional auto-encoders for hierarchical feature extraction. In Proceedings of the International Conference on Artificial Neural Networks. Springer, 5259.Google ScholarGoogle ScholarCross RefCross Ref
  104. [104] Maurer Markus, Gerdes J. Christian, Lenz Barbara, and Winner Hermann. 2016. Autonomous Driving: Technical, Legal and Social Aspects. Springer Nature.Google ScholarGoogle Scholar
  105. [105] Morrison Douglas, Corke Peter, and Leitner Jürgen. 2018. Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach. In Proceedings of the Conference on Robotics: Science and Systems (RSS).Google ScholarGoogle ScholarCross RefCross Ref
  106. [106] Mousavian Arsalan, Anguelov Dragomir, Flynn John, and Kosecka Jana. 2017. 3D bounding box estimation using deep learning and geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 70747082.Google ScholarGoogle ScholarCross RefCross Ref
  107. [107] Movshovitz-Attias Yair, Kanade Takeo, and Sheikh Yaser. 2016. How useful is photo-realistic rendering for visual learning? In Proceedings of the European Conference on Computer Vision. Springer, 202217.Google ScholarGoogle ScholarCross RefCross Ref
  108. [108] Nigam Apurv, Penate-Sanchez Adrian, and Agapito Lourdes. 2018. Detect globally, label locally: Learning accurate 6-DoF object pose estimation by joint segmentation and coordinate regression. IEEE Robot. Automat. Lett. 3, 4 (2018), 39603967.Google ScholarGoogle ScholarCross RefCross Ref
  109. [109] Oberweger Markus, Rad Mahdi, and Lepetit Vincent. 2018. Making deep heatmaps robust to partial occlusions for 3D object pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV). 119134.Google ScholarGoogle ScholarDigital LibraryDigital Library
  110. [110] Park Dennis, Ambrus Rares, Guizilini Vitor, Li Jie, and Gaidon Adrien. 2021. Is pseudo-LiDAR needed for monocular 3D object detection? In Proceedings of the IEEE/CVF International Conference on Computer Vision. 31423152.Google ScholarGoogle ScholarCross RefCross Ref
  111. [111] Park Kiru, Patten Timothy, Prankl Johann, and Vincze Markus. 2019. Multi-task template matching for object detection, segmentation and pose estimation using depth images. In Proceedings of the International Conference on Robotics and Automation (ICRA). IEEE, 72077213.Google ScholarGoogle ScholarDigital LibraryDigital Library
  112. [112] Park Kiru, Patten Timothy, and Vincze Markus. 2019. Pix2Pose: Pixel-wise coordinate regression of objects for 6D pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 76687677.Google ScholarGoogle ScholarCross RefCross Ref
  113. [113] Patil Aniruddha and Rabha Pankaj. 2019. A survey on joint object detection and pose estimation using monocular vision. MATEC Web Conf. 277 (01 2019), 02029. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  114. [114] Pavlakos Georgios, Zhou Xiaowei, Chan Aaron, Derpanis Konstantinos G., and Daniilidis Kostas. 2017. 6-DoF object pose from semantic keypoints. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). IEEE, 20112018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  115. [115] Peddie Jon. 2017. Augmented Reality: Where We Will All Live. Springer.Google ScholarGoogle ScholarCross RefCross Ref
  116. [116] Peng Liang, Liu Fei, Yan Senbo, He Xiaofei, and Cai Deng. 2021. OCM3D: Object-centric monocular 3D object detection. arXiv preprint arXiv:2104.06041 (2021).Google ScholarGoogle Scholar
  117. [117] Peng Liang, Liu Fei, Yu Zhengxu, Yan Senbo, Deng Dan, and Cai Deng. 2021. Lidar point cloud guided monocular 3D object detection. arXiv preprint arXiv:2104.09035 (2021).Google ScholarGoogle Scholar
  118. [118] Peng Sida, Liu Yuan, Huang Qixing, Zhou Xiaowei, and Bao Hujun. 2019. PVNet: Pixel-wise voting network for 6DoF pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 45614570.Google ScholarGoogle ScholarCross RefCross Ref
  119. [119] Qi Charles R., Su Hao, Mo Kaichun, and Guibas Leonidas J.. 2017. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 652660.Google ScholarGoogle Scholar
  120. [120] Qian Rui, Garg Divyansh, Wang Yan, You Yurong, Belongie Serge, Hariharan Bharath, Campbell Mark, Weinberger Kilian Q., and Chao Wei-Lun. 2020. End-to-end pseudo-LiDAR for image-based 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 58815890.Google ScholarGoogle ScholarCross RefCross Ref
  121. [121] Qin Zengyi, Wang Jinglu, and Lu Yan. 2019. MonoGRNet: A geometric reasoning network for monocular 3D object localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 88518858.Google ScholarGoogle ScholarDigital LibraryDigital Library
  122. [122] Rad Mahdi and Lepetit Vincent. 2017. BB8: A scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In Proceedings of the IEEE International Conference on Computer Vision. 38283836.Google ScholarGoogle ScholarCross RefCross Ref
  123. [123] Reading Cody, Harakeh Ali, Chae Julia, and Waslander Steven L.. 2021. Categorical depth distribution network for monocular 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 85558564.Google ScholarGoogle ScholarCross RefCross Ref
  124. [124] Redmon Joseph, Divvala Santosh, Girshick Ross, and Farhadi Ali. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779788.Google ScholarGoogle ScholarCross RefCross Ref
  125. [125] Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2016. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 6 (2016), 11371149.Google ScholarGoogle ScholarDigital LibraryDigital Library
  126. [126] Sahin Caner, Garcia-Hernando Guillermo, Sock Juil, and Kim Tae-Kyun. 2019. Instance-and category-level 6D object pose estimation. In RGB-D Image Analysis and Processing. Springer, 243265.Google ScholarGoogle ScholarCross RefCross Ref
  127. [127] Sahin Caner, Garcia-Hernando Guillermo, Sock Juil, and Kim Tae-Kyun. 2020. A review on object pose recovery: From 3D bounding box detectors to full 6D pose estimators. Image Vis. Comput. 96 (2020), 103898.Google ScholarGoogle ScholarCross RefCross Ref
  128. [128] Sahin Caner and Kim Tae-Kyun. 2018. Category-level 6D object pose recovery in depth images. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops.Google ScholarGoogle Scholar
  129. [129] Sahin Caner and Kim Tae-Kyun. 2018. Recovering 6D object pose: A review and multi-modal analysis. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops.Google ScholarGoogle Scholar
  130. [130] Sandler Mark, Howard Andrew, Zhu Menglong, Zhmoginov Andrey, and Chen Liang-Chieh. 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 45104520.Google ScholarGoogle ScholarCross RefCross Ref
  131. [131] Shi Xuepeng, Chen Zhixiang, and Kim Tae-Kyun. 2020. Distance-normalized unified representation for monocular 3D object detection. In Proceedings of the European Conference on Computer Vision. Springer, 91107.Google ScholarGoogle ScholarDigital LibraryDigital Library
  132. [132] Shotton Jamie, Glocker Ben, Zach Christopher, Izadi Shahram, Criminisi Antonio, and Fitzgibbon Andrew. 2013. Scene coordinate regression forests for camera relocalization in RGB-D images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 29302937.Google ScholarGoogle ScholarDigital LibraryDigital Library
  133. [133] Siam Mennatullah, Valipour Sepehr, Jagersand Martin, and Ray Nilanjan. 2017. Convolutional gated recurrent networks for video segmentation. In Proceedings of the IEEE International Conference on Image Processing (ICIP). IEEE, 30903094.Google ScholarGoogle ScholarDigital LibraryDigital Library
  134. [134] Simonelli Andrea, Bulò Samuel Rota, Porzi Lorenzo, Kontschieder Peter, and Ricci Elisa. 2020. Demystifying Pseudo-LiDAR for monocular 3D object detection. arXiv preprint arXiv:2012.05796 (2020).Google ScholarGoogle Scholar
  135. [135] Simonelli Andrea, Bulo Samuel Rota, Porzi Lorenzo, López-Antequera Manuel, and Kontschieder Peter. 2019. Disentangling monocular 3D object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 19911999.Google ScholarGoogle ScholarCross RefCross Ref
  136. [136] Sock Juil, Garcia-Hernando Guillermo, Armagan Anil, and Kim Tae-Kyun. 2020. Introducing pose consistency and warp-alignment for self-supervised 6D object pose estimation in color images. In Proceedings of the International Conference on 3D Vision (3DV). IEEE, 291300.Google ScholarGoogle ScholarCross RefCross Ref
  137. [137] Song Chen, Song Jiaru, and Huang Qixing. 2020. HybridPose: 6D object pose estimation under hybrid representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 431440.Google ScholarGoogle ScholarCross RefCross Ref
  138. [138] Steger Carsten. 2001. Similarity measures for occlusion, clutter, and illumination invariant object recognition. In Proceedings of the Joint Pattern Recognition Symposium. Springer, 148154.Google ScholarGoogle ScholarCross RefCross Ref
  139. [139] Sundermeyer Martin, Marton Zoltan-Csaba, Durner Maximilian, Brucker Manuel, and Triebel Rudolph. 2018. Implicit 3D orientation learning for 6D object detection from RGB images. In Proceedings of the European Conference on Computer Vision (ECCV). 699715.Google ScholarGoogle ScholarDigital LibraryDigital Library
  140. [140] Tan Mingxing and Le Quoc. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, 61056114.Google ScholarGoogle Scholar
  141. [141] Tan Mingxing, Pang Ruoming, and Le Quoc V.. 2020. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1078110790.Google ScholarGoogle ScholarCross RefCross Ref
  142. [142] Tekin Bugra, Sinha Sudipta N., and Fua Pascal. 2018. Real-time seamless single shot 6D object pose prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 292301.Google ScholarGoogle ScholarCross RefCross Ref
  143. [143] Tian Meng, Ang Marcelo H., and Lee Gim Hee. 2020. Shape prior deformation for categorical 6D object pose and size estimation. In Proceedings of the European Conference on Computer Vision. Springer, 530546.Google ScholarGoogle ScholarDigital LibraryDigital Library
  144. [144] Tobin Josh, Fong Rachel, Ray Alex, Schneider Jonas, Zaremba Wojciech, and Abbeel Pieter. 2017. Domain randomization for transferring deep neural networks from simulation to the real world. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2330.Google ScholarGoogle ScholarDigital LibraryDigital Library
  145. [145] Trabelsi Ameni, Chaabane Mohamed, Blanchard Nathaniel, and Beveridge Ross. 2021. A pose proposal and refinement network for better 6D object pose estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 23822391.Google ScholarGoogle ScholarCross RefCross Ref
  146. [146] Tremblay Jonathan, To Thang, Sundaralingam Balakumar, Xiang Yu, Fox Dieter, and Birchfield Stan. 2018. Deep object pose estimation for semantic robotic grasping of household objects. In Proceedings of the Conference on Robot Learning (CoRL).Google ScholarGoogle Scholar
  147. [147] Umeyama Shinji. 1991. Least-squares estimation of transformation parameters between two point patterns. IEEE Comput. Archit. Lett. 13, 04 (1991), 376380.Google ScholarGoogle Scholar
  148. [148] Wada Kentaro, Sucar Edgar, James Stephen, Lenton Daniel, and Davison Andrew J.. 2020. MoreFusion: Multi-object reasoning for 6D pose estimation from volumetric fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1454014549.Google ScholarGoogle ScholarCross RefCross Ref
  149. [149] Wang Chen, Martín-Martín Roberto, Xu Danfei, Lv Jun, Lu Cewu, Fei-Fei Li, Savarese Silvio, and Zhu Yuke. 2020. 6-PACK: Category-level 6D pose tracker with anchor-based keypoints. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). IEEE, 1005910066.Google ScholarGoogle ScholarCross RefCross Ref
  150. [150] Wang Chen, Xu Danfei, Zhu Yuke, Martín-Martín Roberto, Lu Cewu, Fei-Fei Li, and Savarese Silvio. 2019. Densefusion: 6D object pose estimation by iterative dense fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 33433352.Google ScholarGoogle ScholarCross RefCross Ref
  151. [151] Wang Gu, Manhardt Fabian, Shao Jianzhun, Ji Xiangyang, Navab Nassir, and Tombari Federico. 2020. Self6D: Self-supervised monocular 6D object pose estimation. In Proceedings of the European Conference on Computer Vision. Springer, 108125.Google ScholarGoogle ScholarDigital LibraryDigital Library
  152. [152] Wang Gu, Manhardt Fabian, Tombari Federico, and Ji Xiangyang. 2021. GDR-Net: Geometry-guided direct regression network for monocular 6D object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1661116621.Google ScholarGoogle ScholarCross RefCross Ref
  153. [153] Wang He, Sridhar Srinath, Huang Jingwei, Valentin Julien, Song Shuran, and Guibas Leonidas J.. 2019. Normalized object coordinate space for category-level 6D object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 26422651.Google ScholarGoogle ScholarCross RefCross Ref
  154. [154] Wang Jiadai, Liu Jiajia, and Kato Nei. 2018. Networking and communications in autonomous driving: A survey. IEEE Commun. Surv. Tutor. 21, 2 (2018), 12431274.Google ScholarGoogle ScholarCross RefCross Ref
  155. [155] Wang Tai, Zhu Xinge, Pang Jiangmiao, and Lin Dahua. 2021. FCOS3D: Fully convolutional one-stage monocular 3D object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops.Google ScholarGoogle ScholarCross RefCross Ref
  156. [156] Wang Tai, Zhu Xinge, Pang Jiangmiao, and Lin Dahua. 2021. Probabilistic and geometric depth: Detecting objects in perspective. arXiv preprint arXiv:2107.14160 (2021).Google ScholarGoogle Scholar
  157. [157] Wang Wenhai, Xie Enze, Li Xiang, Fan Deng-Ping, Song Kaitao, Liang Ding, Lu Tong, Luo Ping, and Shao Ling. 2021. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the International Conference on Computer Vision (ICCV).Google ScholarGoogle ScholarCross RefCross Ref
  158. [158] Wang Yan, Chao Wei-Lun, Garg Divyansh, Hariharan Bharath, Campbell Mark, and Weinberger Kilian Q.. 2019. Pseudo-LiDAR from visual depth estimation: Bridging the gap in 3D object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 84458453.Google ScholarGoogle ScholarCross RefCross Ref
  159. [159] Wei Jianing, Ye Genzhi, Mullen Tyler, Grundmann Matthias, Ahmadyan Adel, and Hou Tingbo. 2019. Instant motion tracking and its applications to augmented reality. arXiv preprint arXiv:1907.06796 (2019).Google ScholarGoogle Scholar
  160. [160] Wen Bowen, Mitash Chaitanya, Ren Baozhang, and Bekris Kostas E.. 2020. SE (3)-tracknet: Data-driven 6D pose tracking by calibrating image residuals in synthetic domains. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 1036710373.Google ScholarGoogle ScholarDigital LibraryDigital Library
  161. [161] Weng Xinshuo and Kitani Kris. 2019. Monocular 3D object detection with pseudo-LiDAR point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops.Google ScholarGoogle ScholarCross RefCross Ref
  162. [162] Weng Xinshuo, Wang Jianren, Held David, and Kitani Kris. 2020. 3D multi-object tracking: A baseline and new evaluation metrics. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 1035910366.Google ScholarGoogle ScholarDigital LibraryDigital Library
  163. [163] Weng Xinshuo, Yuan Ye, and Kitani Kris. 2020. Joint 3D tracking and forecasting with graph neural network and diversity sampling. arXiv preprint arXiv:2003.07847 (2020).Google ScholarGoogle Scholar
  164. [164] Wohlhart Paul and Lepetit Vincent. 2015. Learning descriptors for object recognition and 3D pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31093118.Google ScholarGoogle ScholarCross RefCross Ref
  165. [165] Wu Yangzheng, Zand Mohsen, Etemad Ali, and Greenspan Michael. 2021. Vote from the center: 6 DoF pose estimation in RGB-D images by radial keypoint voting. arXiv preprint arXiv:2104.02527 (2021).Google ScholarGoogle Scholar
  166. [166] Xiang Yu, Schmidt Tanner, Narayanan Venkatraman, and Fox Dieter. 2017. PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199 (2017).Google ScholarGoogle Scholar
  167. [167] Shi Xingjian, Chen Zhourong, Wang Hao, Yeung Dit-Yan, Wong Wai-Kin, and Woo Wang-chun. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 802810.Google ScholarGoogle Scholar
  168. [168] Yang Zongxin, Yu Xin, and Yang Yi. 2021. DSC-PoseNet: Learning 6DoF object pose estimation via dual-scale consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 39073916.Google ScholarGoogle ScholarCross RefCross Ref
  169. [169] Ye Xiaoqing, Du Liang, Shi Yifeng, Li Yingying, Tan Xiao, Feng Jianfeng, Ding Errui, and Wen Shilei. 2020. Monocular 3D object detection via feature domain adaptation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16. Springer, 1734.Google ScholarGoogle ScholarDigital LibraryDigital Library
  170. [170] Yen-Chen Lin, Florence Pete, Barron Jonathan T., Rodriguez Alberto, Isola Phillip, and Lin Tsung-Yi. 2020. iNeRF: Inverting neural radiance fields for pose estimation. arXiv preprint arXiv:2012.05877 (2020).Google ScholarGoogle Scholar
  171. [171] You Yurong, Wang Yan, Chao Wei-Lun, Garg Divyansh, Pleiss Geoff, Hariharan Bharath, Campbell Mark, and Weinberger Kilian Q.. 2020. Pseudo-LiDAR++: Accurate depth for 3D object detection in autonomous driving. Proceedings of the Conference on International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  172. [172] Yu Xin, Zhuang Zheyu, Koniusz Piotr, and Li Hongdong. 2020. 6DoF object pose estimation via differentiable proxy voting loss. In Proceedings of the British Machine Vision Conference (BMVC).Google ScholarGoogle Scholar
  173. [173] Zakharov Sergey, Kehl Wadim, Planche Benjamin, Hutter Andreas, and Ilic Slobodan. 2017. 3D object instance recognition and pose estimation using triplet loss with dynamic margin. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 552559.Google ScholarGoogle ScholarDigital LibraryDigital Library
  174. [174] Zakharov Sergey, Shugurov Ivan, and Ilic Slobodan. 2019. DPOD: 6D pose object detector and refiner. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 19411950.Google ScholarGoogle ScholarCross RefCross Ref
  175. [175] Zhang Xiangyu, Zhou Xinyu, Lin Mengxiao, and Sun Jian. 2018. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 68486856.Google ScholarGoogle ScholarCross RefCross Ref
  176. [176] Zhao Zelin, Peng Gao, Wang Haoyu, Fang Hao-Shu, Li Chengkun, and Lu Cewu. 2018. Estimating 6D pose from localizing designated surface keypoints. arXiv preprint arXiv:1812.01387 (2018).Google ScholarGoogle Scholar
  177. [177] Zhong Leisheng, Zhang Yu, Zhao Hao, Chang An, Xiang Wenhao, Zhang Shunli, and Zhang Li. 2020. Seeing through the occluders: Robust monocular 6-DoF object pose tracking via model-guided video object segmentation. IEEE Robot. Automat. Lett. 5, 4 (2020), 51595166.Google ScholarGoogle ScholarCross RefCross Ref
  178. [178] Zhou Xingyi, Koltun Vladlen, and Krähenbühl Philipp. 2020. Tracking objects as points. In Proceedings of the European Conference on Computer Vision. Springer, 474490.Google ScholarGoogle ScholarDigital LibraryDigital Library
  179. [179] Zhou Xichuan, Peng Yicong, Long Chunqiao, Ren Fengbo, and Shi Cong. 2020. MoNet3D: Towards accurate monocular 3D object localization in real time. In Proceedings of the International Conference on Machine Learning. PMLR, 1150311512.Google ScholarGoogle Scholar
  180. [180] Zhu Chenchen, He Yihui, and Savvides Marios. 2019. Feature selective anchor-free module for single-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 840849.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Deep Learning on Monocular Object Pose Detection and Tracking: A Comprehensive Overview

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Computing Surveys
        ACM Computing Surveys  Volume 55, Issue 4
        April 2023
        871 pages
        ISSN:0360-0300
        EISSN:1557-7341
        DOI:10.1145/3567469
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 21 November 2022
        • Online AM: 31 March 2022
        • Accepted: 8 March 2022
        • Revised: 17 January 2022
        • Received: 8 June 2021
        Published in csur Volume 55, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • survey
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format