Abstract
Object pose detection and tracking has recently attracted increasing attention due to its wide applications in many areas, such as autonomous driving, robotics, and augmented reality. Among methods for object pose detection and tracking, deep learning is the most promising one that has shown better performance than others. However, survey study about the latest development of deep learning-based methods is lacking. Therefore, this study presents a comprehensive review of recent progress in object pose detection and tracking that belongs to the deep learning technical route. To achieve a more thorough introduction, the scope of this study is limited to methods taking monocular RGB/RGBD data as input and covering three kinds of major tasks: instance-level monocular object pose detection, category-level monocular object pose detection, and monocular object pose tracking. In our work, metrics, datasets, and methods of both detection and tracking are presented in detail. Comparative results of current state-of-the-art methods on several publicly available datasets are also presented, together with insightful observations and inspiring future research directions.
- [1] . 2020. Instant 3D object tracking with applications in augmented reality. arXiv preprint arXiv:2006.13194 (2020).Google Scholar
- [2] . 2021. Objectron: A large scale dataset of object-centric videos in the wild with pose annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7822–7831.Google ScholarCross Ref
- [3] . 2019. A survey on 3D object detection methods for autonomous driving applications. IEEE Trans. Intell. Transport. Syst. 20, 10 (2019), 3782–3795.Google ScholarCross Ref
- [4] . 2017. Pose guided RGBD feature learning for 3D object pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 3856–3864.Google ScholarCross Ref
- [5] . 2018. Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). IEEE, 4243–4250.Google ScholarDigital Library
- [6] . 2014. Learning 6D object pose estimation using 3D object coordinates. In Proceedings of the European Conference on Computer Vision. Springer, 536–551.Google ScholarCross Ref
- [7] . 2016. Uncertainty-driven 6D pose estimation of objects and scenes from a single RGB image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3364–3372.Google ScholarCross Ref
- [8] . 2019. M3D-RPN: Monocular 3D region proposal network for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9287–9296.Google ScholarCross Ref
- [9] . 2020. Kinematic 3D object detection in monocular video. In Proceedings of the European Conference on Computer Vision. Springer, 135–152.Google ScholarDigital Library
- [10] . 2020. EfficientPose–An efficient, accurate and scalable end-to-end 6D multi object pose estimation approach. arXiv preprint arXiv:2011.04307 (2020).Google Scholar
- [11] . 2020. I like to move it: 6D pose estimation as an action decision process. arXiv preprint arXiv:2009.12678 (2020).Google Scholar
- [12] . 2020. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11621–11631.Google ScholarCross Ref
- [13] . 2015. The YCB object and model set: Towards common benchmarks for manipulation research. In Proceedings of the International Conference on Advanced Robotics (ICAR). IEEE, 510–517.Google ScholarCross Ref
- [14] . 2015. ShapeNet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015).Google Scholar
- [15] . 2020. End-to-end learnable geometric vision by backpropagating PnP optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8100–8109.Google ScholarCross Ref
- [16] . 2020. Learning canonical shape space for category-level 6D object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11973–11982.Google ScholarCross Ref
- [17] . 2017. Learning efficient object detection models with knowledge distillation. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 742–751.Google ScholarDigital Library
- [18] . 2020. Survey on 6D pose estimation of rigid object. In Proceedings of the 39th Chinese Control Conference (CCC). IEEE, 7440–7445.Google ScholarCross Ref
- [19] . 2021. SGPA: Structure-guided prior adaptation for category-level 6D object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2773–2782.Google ScholarCross Ref
- [20] . 2020. PointPoseNet: Point pose network for robust 6D object pose estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2824–2833.Google ScholarCross Ref
- [21] . 2020. G2L-Net: Global to local network for real-time 6D pose estimation with embedding vector features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4233–4242.Google ScholarCross Ref
- [22] . 2021. FS-Net: Fast shape-based network for category-level 6D object pose estimation with decoupled rotation mechanism. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1581–1590.Google ScholarCross Ref
- [23] . 2019. Learning to predict 3D objects with an interpolation-based differentiable renderer. Adv. Neural Inf. Process. Syst. 32 (2019), 9609–9619.Google Scholar
- [24] . 2020. Detecting 6D poses of target objects from cluttered scenes by learning to align the point cloud patches with the CAD models. IEEE Access 8 (2020), 210640–210650.Google ScholarCross Ref
- [25] . 2020. Category level object pose estimation via neural analysis-by-synthesis. In Proceedings of the European Conference on Computer Vision. Springer, 139–156.Google ScholarDigital Library
- [26] . 2016. Monocular 3D object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2147–2156.Google ScholarCross Ref
- [27] . 2018. The past, present, and future of virtual and augmented reality research: A network and cluster analysis of the literature. Front. Psychol. 9 (2018), 2086.Google ScholarCross Ref
- [28] . 2021. PoseRBPF: A Rao–Blackwellized particle filter for 6-D object pose tracking. IEEE Trans. Robot. 37, 5 (2021), 1328–1342.Google Scholar
- [29] . 2020. Self-supervised 6D object pose estimation for robot manipulation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). IEEE, 3665–3671.Google ScholarCross Ref
- [30] . 2020. Learning depth-guided convolutions for monocular 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 1000–1001.Google ScholarCross Ref
- [31] . 2015. FlowNet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 2758–2766.Google ScholarDigital Library
- [32] . 2019. Vision-based robotic grasping from object localization pose estimation grasp detection to motion planning: A review. arXiv preprint arXiv:1905.06658 (2019).Google Scholar
- [33] . 2019. CenterNet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6569–6578.Google ScholarCross Ref
- [34] . 2018. Learning SO (3) equivariant representations with spherical CNNs. In Proceedings of the European Conference on Computer Vision (ECCV). 52–68.Google ScholarDigital Library
- [35] . 2021. ACR-Pose: Adversarial canonical representation reconstruction network for category level 6D object pose estimation. arXiv preprint arXiv:2111.10524 (2021).Google Scholar
- [36] . 2021. Point-cloud based 3D object detection and classification methods for self-driving applications: A survey and taxonomy. Inf. Fusion 68 (2021), 161–191.Google ScholarCross Ref
- [37] . 2021. CloudAAE: Learning 6D object pose regression with on-line data synthesis on point clouds. arXiv preprint arXiv:2103.01977 (2021).Google Scholar
- [38] . 2020. 6D object pose regression via supervised learning on point clouds. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). IEEE, 3643–3649.Google ScholarCross Ref
- [39] . 2020. Monocular 3D object detection with sequential feature association and depth hint augmentation. arXiv preprint arXiv:2011.14589 (2020).Google Scholar
- [40] . 2017. Deep 6-DOF tracking. IEEE Trans. Visualiz. Comput. Graph. 23, 11 (2017), 2410–2418.Google ScholarDigital Library
- [41] . 2019. Towards augmented reality manuals for industry 4.0: A methodology. Robot. Comput.-integ. Manuf. 56 (2019), 276–286.Google ScholarCross Ref
- [42] . 2012. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3354–3361.Google ScholarCross Ref
- [43] . 2020. A survey of deep learning techniques for autonomous driving. J. Field Robot. 37, 3 (2020), 362–386.Google ScholarCross Ref
- [44] . 2020. Deep learning for 3D point clouds: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 43, 12 (2020), 4338–4364.Google ScholarDigital Library
- [45] . 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 2961–2969.Google ScholarCross Ref
- [46] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google ScholarCross Ref
- [47] . 2021. FFB6D: A full flow bidirectional fusion network for 6D pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3003–3013.Google ScholarCross Ref
- [48] . 2020. PVN3D: A deep point-wise 3D keypoints voting network for 6DoF pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11632–11641.Google ScholarCross Ref
- [49] . 2011. Gradient response maps for real-time detection of textureless objects. IEEE Trans. Pattern Anal. Mach. Intell. 34, 5 (2011), 876–888.Google ScholarDigital Library
- [50] . 2011. Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In Proceedings of the International Conference on Computer Vision. IEEE, 858–865.Google ScholarDigital Library
- [51] . 2012. Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In Proceedings of the Asian Conference on Computer Vision. Springer, 548–562.Google ScholarDigital Library
- [52] . 1997. Long short-term memory. Neural Computat. 9, 8 (1997), 1735–1780.Google ScholarDigital Library
- [53] . 2020. EPOS: Estimating 6D pose of objects with symmetries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11703–11712.Google ScholarCross Ref
- [54] . 2017. T-LESS: An RGB-D dataset for 6D pose estimation of texture-less objects. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 880–888.Google ScholarCross Ref
- [55] . 2016. On evaluation of 6D object pose estimation. In Proceedings of the European Conference on Computer Vision. Springer, 606–619.Google Scholar
- [56] . 2020. MobilePose: Real-time pose estimation for unseen objects with weak shape supervision. arXiv preprint arXiv:2003.03522 (2020).Google Scholar
- [57] . 2019. Joint monocular 3D vehicle detection and tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5390–5399.Google ScholarCross Ref
- [58] . 2021. Monocular quasi-dense 3D object tracking. arXiv preprint arXiv:2103.07351 (2021).Google Scholar
- [59] . 2020. Single-stage 6D object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2930–2939.Google ScholarCross Ref
- [60] . 2019. Segmentation-driven 6D object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3385–3394.Google ScholarCross Ref
- [61] . 2018. The ApolloScape dataset for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 954–960.Google ScholarCross Ref
- [62] . 1993. Comparing images using the Hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 15, 9 (1993), 850–863.Google ScholarDigital Library
- [63] . 2018. Augmented reality for STEM learning: A systematic review. Comput. Educ. 123 (2018), 109–123.Google ScholarDigital Library
- [64] . 2018. iPose: Instance-aware 6D pose estimation of partly occluded objects. In Proceedings of the Asian Conference on Computer Vision. Springer, 477–492.Google Scholar
- [65] . 2019. Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12627–12637.Google ScholarCross Ref
- [66] . 2019. Monocular 3D object detection and box fitting trained end-to-end using intersection-over-union loss. arXiv preprint arXiv:1906.08070 (2019).Google Scholar
- [67] . 1960. A new approach to linear filtering and prediction problems. 35–45.Google Scholar
- [68] . 2019. HomebrewedDB: RGB-D dataset for 6D pose estimation of 3D objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops.Google ScholarCross Ref
- [69] . 2017. SSD-6D: Making RGB-based 3D detection and 6D pose estimation great again. In Proceedings of the IEEE International Conference on Computer Vision. 1521–1529.Google ScholarCross Ref
- [70] . 2016. Deep learning of local RGB-D patches for 3D object detection and 6D pose estimation. In Proceedings of the European Conference on Computer Vision. Springer, 205–220.Google ScholarCross Ref
- [71] . 2016. Sequence-level knowledge distillation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1317–1327.Google ScholarCross Ref
- [72] . 2020. A survey on learning-based robotic grasping. Curr. Robot. Rep. 1, 4 (2020), 239–249.Google ScholarCross Ref
- [73] . 2015. Learning analysis-by-synthesis for 6D pose estimation in RGB-D images. In Proceedings of the IEEE International Conference on Computer Vision. 954–962.Google ScholarDigital Library
- [74] . 2014. 6-DoF model based tracking via object coordinate regression. In Proceedings of the Asian Conference on Computer Vision. Springer, 384–399.Google Scholar
- [75] . 1955. The Hungarian method for the assignment problem. Naval Res. Logist. Quart. 2, 1–2 (1955), 83–97.Google ScholarCross Ref
- [76] . 2021. Category-level metric scale object shape and pose estimation. IEEE Robot. Automat. Lett. 6, 4 (2021), 8575–8582.Google ScholarCross Ref
- [77] . 2019. Motion-Nets: 6D tracking of unknown objects in unseen environments using RGB. arXiv preprint arXiv:1910.13942 (2019).Google Scholar
- [78] . 2009. EPnP: An accurate O (n) solution to the PnP problem. Int. J. Comput. Vis. 81, 2 (2009), 155.Google ScholarDigital Library
- [79] . 2011. Towards fully autonomous driving: Systems and algorithms. In Proceedings of the IEEE Intelligent Vehicles Symposium. IEEE, 163–168.Google ScholarCross Ref
- [80] . 2019. GS3D: An efficient 3D object detection framework for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1019–1028.Google ScholarCross Ref
- [81] . 2021. Monocular 3D detection with geometric constraint embedding and semi-supervised training. IEEE Robot. Automat. Lett. 6, 3 (2021), 5565–5572.Google Scholar
- [82] . 2020. RTM3D: Real-time monocular 3D detection from object keypoints for autonomous driving. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. Springer, 644–660.Google ScholarDigital Library
- [83] . 2020. Category-level articulated object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3706–3715.Google ScholarCross Ref
- [84] . 2018. DeepIM: Deep iterative matching for 6D pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV). 683–698.Google ScholarDigital Library
- [85] . 2020. Robust RGB-based 6-DoF pose estimation without real pose annotations. arXiv preprint arXiv:2008.08391 (2020).Google Scholar
- [86] . 2019. CDPN: Coordinates-based disentangled pose network for real-time RGB-based 6-DoF object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7678–7687.Google ScholarCross Ref
- [87] . 2021. DualPoseNet: Category-level 6D object pose and size estimation using dual pose network with refined learning of pose consistency. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 3560–3569.Google ScholarCross Ref
- [88] . 2021. Single-stage keypoint-based category-level object pose estimation from an RGB image. arXiv preprint arXiv:2109.06161 (2021).Google Scholar
- [89] . 2019. Deep fitting degree scoring network for monocular 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1057–1066.Google ScholarCross Ref
- [90] . 2020. Reinforced axial refinement network for monocular 3D object detection. In Proceedings of the European Conference on Computer Vision. Springer, 540–556.Google ScholarDigital Library
- [91] . 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision. Springer, 21–37.Google ScholarCross Ref
- [92] . 2021. Ground-aware monocular 3D object detection for autonomous driving. IEEE Robot. Automat. Lett. 6, 2 (2021), 919–926.Google ScholarCross Ref
- [93] . 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the International Conference on Computer Vision (ICCV).Google Scholar
- [94] . 2020. SMOKE: Single-stage monocular 3D object detection via keypoint estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 996–997.Google ScholarCross Ref
- [95] . 2004. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 2 (2004), 91–110.Google ScholarDigital Library
- [96] . 2020. Rethinking pseudo-LiDAR representation. In Proceedings of the European Conference on Computer Vision. Springer, 311–327.Google ScholarDigital Library
- [97] F. Manhardt, G. Wang, B. Busam, et al. 2020. CPS++: Improving class-level 6D pose and shape estimation from monocular images with self-supervised learning[J]. arXiv preprint arXiv:2003.0584.Google Scholar
- [98] . 2020. 3D model-based 6D object pose tracking on RGB images using particle filtering and heuristic optimization. In VISIGRAPP (5: VISAPP). 690–697.Google Scholar
- [99] . 2019. ROI-10D: Monocular lifting of 2D detection to 6D pose and metric shape. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2069–2078.Google ScholarCross Ref
- [100] . 2018. Deep model-based 6D pose refinement in RGB. In Proceedings of the European Conference on Computer Vision (ECCV). 800–815.Google ScholarDigital Library
- [101] . 2020. CPS++: Improving class-level 6D pose and shape estimation from monocular images with self-supervised learning. arXiv e-prints (2020).Google Scholar
- [102] . 2020. How to track your dragon: A multi-attentional framework for real-time RGB-D 6-DoF object pose tracking. In Proceedings of the European Conference on Computer Vision. Springer, 682–699.Google ScholarDigital Library
- [103] . 2011. Stacked convolutional auto-encoders for hierarchical feature extraction. In Proceedings of the International Conference on Artificial Neural Networks. Springer, 52–59.Google ScholarCross Ref
- [104] . 2016. Autonomous Driving: Technical, Legal and Social Aspects. Springer Nature.Google Scholar
- [105] . 2018. Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach. In Proceedings of the Conference on Robotics: Science and Systems (RSS).Google ScholarCross Ref
- [106] . 2017. 3D bounding box estimation using deep learning and geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7074–7082.Google ScholarCross Ref
- [107] . 2016. How useful is photo-realistic rendering for visual learning? In Proceedings of the European Conference on Computer Vision. Springer, 202–217.Google ScholarCross Ref
- [108] . 2018. Detect globally, label locally: Learning accurate 6-DoF object pose estimation by joint segmentation and coordinate regression. IEEE Robot. Automat. Lett. 3, 4 (2018), 3960–3967.Google ScholarCross Ref
- [109] . 2018. Making deep heatmaps robust to partial occlusions for 3D object pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV). 119–134.Google ScholarDigital Library
- [110] . 2021. Is pseudo-LiDAR needed for monocular 3D object detection? In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3142–3152.Google ScholarCross Ref
- [111] . 2019. Multi-task template matching for object detection, segmentation and pose estimation using depth images. In Proceedings of the International Conference on Robotics and Automation (ICRA). IEEE, 7207–7213.Google ScholarDigital Library
- [112] . 2019. Pix2Pose: Pixel-wise coordinate regression of objects for 6D pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7668–7677.Google ScholarCross Ref
- [113] . 2019. A survey on joint object detection and pose estimation using monocular vision. MATEC Web Conf. 277 (
01 2019), 02029.DOI :Google ScholarCross Ref - [114] . 2017. 6-DoF object pose from semantic keypoints. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2011–2018.Google ScholarDigital Library
- [115] . 2017. Augmented Reality: Where We Will All Live. Springer.Google ScholarCross Ref
- [116] . 2021. OCM3D: Object-centric monocular 3D object detection. arXiv preprint arXiv:2104.06041 (2021).Google Scholar
- [117] . 2021. Lidar point cloud guided monocular 3D object detection. arXiv preprint arXiv:2104.09035 (2021).Google Scholar
- [118] . 2019. PVNet: Pixel-wise voting network for 6DoF pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4561–4570.Google ScholarCross Ref
- [119] . 2017. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 652–660.Google Scholar
- [120] . 2020. End-to-end pseudo-LiDAR for image-based 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5881–5890.Google ScholarCross Ref
- [121] . 2019. MonoGRNet: A geometric reasoning network for monocular 3D object localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8851–8858.Google ScholarDigital Library
- [122] . 2017. BB8: A scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In Proceedings of the IEEE International Conference on Computer Vision. 3828–3836.Google ScholarCross Ref
- [123] . 2021. Categorical depth distribution network for monocular 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8555–8564.Google ScholarCross Ref
- [124] . 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779–788.Google ScholarCross Ref
- [125] . 2016. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 6 (2016), 1137–1149.Google ScholarDigital Library
- [126] . 2019. Instance-and category-level 6D object pose estimation. In RGB-D Image Analysis and Processing. Springer, 243–265.Google ScholarCross Ref
- [127] . 2020. A review on object pose recovery: From 3D bounding box detectors to full 6D pose estimators. Image Vis. Comput. 96 (2020), 103898.Google ScholarCross Ref
- [128] . 2018. Category-level 6D object pose recovery in depth images. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops.Google Scholar
- [129] . 2018. Recovering 6D object pose: A review and multi-modal analysis. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops.Google Scholar
- [130] . 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4510–4520.Google ScholarCross Ref
- [131] . 2020. Distance-normalized unified representation for monocular 3D object detection. In Proceedings of the European Conference on Computer Vision. Springer, 91–107.Google ScholarDigital Library
- [132] . 2013. Scene coordinate regression forests for camera relocalization in RGB-D images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2930–2937.Google ScholarDigital Library
- [133] . 2017. Convolutional gated recurrent networks for video segmentation. In Proceedings of the IEEE International Conference on Image Processing (ICIP). IEEE, 3090–3094.Google ScholarDigital Library
- [134] . 2020. Demystifying Pseudo-LiDAR for monocular 3D object detection. arXiv preprint arXiv:2012.05796 (2020).Google Scholar
- [135] . 2019. Disentangling monocular 3D object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1991–1999.Google ScholarCross Ref
- [136] . 2020. Introducing pose consistency and warp-alignment for self-supervised 6D object pose estimation in color images. In Proceedings of the International Conference on 3D Vision (3DV). IEEE, 291–300.Google ScholarCross Ref
- [137] . 2020. HybridPose: 6D object pose estimation under hybrid representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 431–440.Google ScholarCross Ref
- [138] . 2001. Similarity measures for occlusion, clutter, and illumination invariant object recognition. In Proceedings of the Joint Pattern Recognition Symposium. Springer, 148–154.Google ScholarCross Ref
- [139] . 2018. Implicit 3D orientation learning for 6D object detection from RGB images. In Proceedings of the European Conference on Computer Vision (ECCV). 699–715.Google ScholarDigital Library
- [140] . 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, 6105–6114.Google Scholar
- [141] . 2020. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10781–10790.Google ScholarCross Ref
- [142] . 2018. Real-time seamless single shot 6D object pose prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 292–301.Google ScholarCross Ref
- [143] . 2020. Shape prior deformation for categorical 6D object pose and size estimation. In Proceedings of the European Conference on Computer Vision. Springer, 530–546.Google ScholarDigital Library
- [144] . 2017. Domain randomization for transferring deep neural networks from simulation to the real world. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 23–30.Google ScholarDigital Library
- [145] . 2021. A pose proposal and refinement network for better 6D object pose estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2382–2391.Google ScholarCross Ref
- [146] . 2018. Deep object pose estimation for semantic robotic grasping of household objects. In Proceedings of the Conference on Robot Learning (CoRL).Google Scholar
- [147] . 1991. Least-squares estimation of transformation parameters between two point patterns. IEEE Comput. Archit. Lett. 13, 04 (1991), 376–380.Google Scholar
- [148] . 2020. MoreFusion: Multi-object reasoning for 6D pose estimation from volumetric fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14540–14549.Google ScholarCross Ref
- [149] . 2020. 6-PACK: Category-level 6D pose tracker with anchor-based keypoints. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). IEEE, 10059–10066.Google ScholarCross Ref
- [150] . 2019. Densefusion: 6D object pose estimation by iterative dense fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3343–3352.Google ScholarCross Ref
- [151] . 2020. Self6D: Self-supervised monocular 6D object pose estimation. In Proceedings of the European Conference on Computer Vision. Springer, 108–125.Google ScholarDigital Library
- [152] . 2021. GDR-Net: Geometry-guided direct regression network for monocular 6D object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16611–16621.Google ScholarCross Ref
- [153] . 2019. Normalized object coordinate space for category-level 6D object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2642–2651.Google ScholarCross Ref
- [154] . 2018. Networking and communications in autonomous driving: A survey. IEEE Commun. Surv. Tutor. 21, 2 (2018), 1243–1274.Google ScholarCross Ref
- [155] . 2021. FCOS3D: Fully convolutional one-stage monocular 3D object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops.Google ScholarCross Ref
- [156] . 2021. Probabilistic and geometric depth: Detecting objects in perspective. arXiv preprint arXiv:2107.14160 (2021).Google Scholar
- [157] . 2021. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the International Conference on Computer Vision (ICCV).Google ScholarCross Ref
- [158] . 2019. Pseudo-LiDAR from visual depth estimation: Bridging the gap in 3D object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8445–8453.Google ScholarCross Ref
- [159] . 2019. Instant motion tracking and its applications to augmented reality. arXiv preprint arXiv:1907.06796 (2019).Google Scholar
- [160] . 2020. SE (3)-tracknet: Data-driven 6D pose tracking by calibrating image residuals in synthetic domains. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 10367–10373.Google ScholarDigital Library
- [161] . 2019. Monocular 3D object detection with pseudo-LiDAR point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops.Google ScholarCross Ref
- [162] . 2020. 3D multi-object tracking: A baseline and new evaluation metrics. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 10359–10366.Google ScholarDigital Library
- [163] . 2020. Joint 3D tracking and forecasting with graph neural network and diversity sampling. arXiv preprint arXiv:2003.07847 (2020).Google Scholar
- [164] . 2015. Learning descriptors for object recognition and 3D pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3109–3118.Google ScholarCross Ref
- [165] . 2021. Vote from the center: 6 DoF pose estimation in RGB-D images by radial keypoint voting. arXiv preprint arXiv:2104.02527 (2021).Google Scholar
- [166] . 2017. PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199 (2017).Google Scholar
- [167] . 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 802–810.Google Scholar
- [168] . 2021. DSC-PoseNet: Learning 6DoF object pose estimation via dual-scale consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3907–3916.Google ScholarCross Ref
- [169] . 2020. Monocular 3D object detection via feature domain adaptation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16. Springer, 17–34.Google ScholarDigital Library
- [170] . 2020. iNeRF: Inverting neural radiance fields for pose estimation. arXiv preprint arXiv:2012.05877 (2020).Google Scholar
- [171] . 2020. Pseudo-LiDAR++: Accurate depth for 3D object detection in autonomous driving. Proceedings of the Conference on International Conference on Learning Representations (ICLR).Google Scholar
- [172] . 2020. 6DoF object pose estimation via differentiable proxy voting loss. In Proceedings of the British Machine Vision Conference (BMVC).Google Scholar
- [173] . 2017. 3D object instance recognition and pose estimation using triplet loss with dynamic margin. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 552–559.Google ScholarDigital Library
- [174] . 2019. DPOD: 6D pose object detector and refiner. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1941–1950.Google ScholarCross Ref
- [175] . 2018. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6848–6856.Google ScholarCross Ref
- [176] . 2018. Estimating 6D pose from localizing designated surface keypoints. arXiv preprint arXiv:1812.01387 (2018).Google Scholar
- [177] . 2020. Seeing through the occluders: Robust monocular 6-DoF object pose tracking via model-guided video object segmentation. IEEE Robot. Automat. Lett. 5, 4 (2020), 5159–5166.Google ScholarCross Ref
- [178] . 2020. Tracking objects as points. In Proceedings of the European Conference on Computer Vision. Springer, 474–490.Google ScholarDigital Library
- [179] . 2020. MoNet3D: Towards accurate monocular 3D object localization in real time. In Proceedings of the International Conference on Machine Learning. PMLR, 11503–11512.Google Scholar
- [180] . 2019. Feature selective anchor-free module for single-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 840–849.Google ScholarCross Ref
Index Terms
Deep Learning on Monocular Object Pose Detection and Tracking: A Comprehensive Overview
Recommendations
Semi-dense Visual Odometry for a Monocular Camera
ICCV '13: Proceedings of the 2013 IEEE International Conference on Computer VisionWe propose a fundamentally novel approach to real-time visual odometry for a monocular camera. It allows to benefit from the simplicity and accuracy of dense tracking - which does not depend on visual features - while running in real-time on a CPU. The ...
Multi-view LiDAR Guided Monocular 3D Object Detection
Pattern Recognition and Computer VisionAbstractDetecting 3D objects from monocular RGB images is an ill-posed task for lacking depth knowledge, and monocular-based 3D detection methods perform poorly compared with LiDAR-based 3D detection methods. Some bird’s-eye-view-based monocular 3D ...
Silhouette lookup for monocular 3D pose tracking
Computers should be able to detect and track the articulated 3D pose of a human being moving through a video sequence. Incremental tracking methods often prove slow and unreliable, and many must be initialized by a human operator before they can track a ...
Comments