Abstract
Human and environment sensing are two important topics in Computer Vision and Graphics. Human motion is often captured by inertial sensors, while the environment is mostly reconstructed using cameras. We integrate the two techniques together in EgoLocate, a system that simultaneously performs human motion capture (mocap), localization, and mapping in real time from sparse body-mounted sensors, including 6 inertial measurement units (IMUs) and a monocular phone camera. On one hand, inertial mocap suffers from large translation drift due to the lack of the global positioning signal. EgoLo-cate leverages image-based simultaneous localization and mapping (SLAM) techniquesto locate the human in the reconstructed scene. Onthe other hand, SLAM often fails when the visual feature is poor. EgoLocate involves inertial mocap to provide a strong prior for the camera motion. Experiments show that localization, a key challenge for both two fields, is largely improved by our technique, compared with the state of the art of the two fields. Our codes are available for research at https://xinyu-yi.github.io/EgoLocate/.
Supplemental Material
Available for Download
supplemental material.
- Hiroyasu Akada, Jian Wang, Soshi Shimada, Masaki Takahashi, Christian Theobalt, and Vladislav Golyanik. 2022. UnrealEgo: A New Dataset for Robust Egocentric 3D Human Motion Capture. In European Conference on Computer Vision (ECCV).Google ScholarDigital Library
- Michael Bloesch, Jan Czarnowski, Ronald Clark, Stefan Leutenegger, and Andrew J Davison. 2018. CodeSLAM---learning a compact, optimisable representation for dense visual SLAM. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2560--2568.Google ScholarCross Ref
- Carlos Campos, Richard Elvira, Juan J. Gómez, José M. M. Montiel, and Juan D. Tardós. 2021. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM. IEEE Transactions on Robotics 37, 6 (2021), 1874--1890.Google ScholarCross Ref
- Robert Castle, Georg Klein, and David W Murray. 2008. Video-rate localization in multiple maps for wearable augmented reality. In 2008 12th IEEE International Symposium on Wearable Computers. IEEE, 15--22.Google ScholarDigital Library
- Young-Woon Cha, Husam Shaik, Qian Zhang, Fan Feng, Andrei State, Adrian Ilie, and Henry Fuchs. 2021. Mobile. Egocentric Human Body Motion Reconstruction Using Only Eyeglasses-mounted Cameras and a Few Body-worn Inertial Sensors. In 2021 IEEE Virtual Reality and 3D User Interfaces (VR). 616--625.Google Scholar
- Long Chen, Haizhou Ai, Rui Chen, Zijie Zhuang, and Shuang Liu. 2020. Cross-View Tracking for Multi-Human 3D Pose Estimation at Over 100 FPS. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- Yudi Dai, Yitai Lin, Chenglu Wen, Siqi Shen, Lan Xu, Jingyi Yu, Yuexin Ma, and Cheng Wang. 2022. HSC4D: Human-Centered 4D Scene Capture in Large-Scale Indoor-Outdoor Space Using Wearable IMUs and LiDAR. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6792--6802.Google ScholarCross Ref
- Andrew J Davison. 2003. Real-time simultaneous localisation and mapping with a single camera. In Computer Vision, IEEE International Conference on, Vol. 3. IEEE Computer Society, 1403--1403.Google Scholar
- Junting Dong, Wen Jiang, Qixing Huang, Hujun Bao, and Xiaowei Zhou. 2019. Fast and Robust Multi-Person 3D Pose Estimation From Multiple Views. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- Jakob Engel, Vladlen Koltun, and Daniel Cremers. 2017. Direct sparse odometry. IEEE transactions on pattern analysis and machine intelligence 40, 3 (2017), 611--625.Google Scholar
- Jakob Engel, Thomas Schöps, and Daniel Cremers. 2014. LSD-SLAM: Large-scale direct monocular SLAM. In European conference on computer vision. Springer, 834--849.Google ScholarCross Ref
- Martin Felis. 2017. RBDL: an efficient rigid-body dynamics library using recursive algorithms. Autonomous Robots 41 (02 2017).Google Scholar
- Christian Forster, Matia Pizzoli, and Davide Scaramuzza. 2014. SVO: Fast semi-direct monocular visual odometry. In 2014 IEEE international conference on robotics and automation (ICRA). IEEE, 15--22.Google ScholarCross Ref
- Jack H. Geissinger and Alan T. Asbeck. 2020. Motion Inference Using Sparse Inertial Sensors, Self-Supervised Learning, and a New Dataset of Unscripted Human Motion. Sensors 20 (2020).Google Scholar
- Vladimir Guzov, Aymen Mir, Torsten Sattler, and Gerard Pons-Moll. 2021. Human POSEitioning System (HPS): 3D Human Pose Estimation and Self-localization in Large Scenes from Body-Mounted Sensors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- Vladimir Guzov, Torsten Sattler, and Gerard Pons-Moll. 2022. Visually plausible human-object interaction capture from wearable sensors. In arXiv.Google Scholar
- Dorian F. Henning, Tristan Laidlow, and Stefan Leutenegger. 2022. BodySLAM: Joint Camera Localisation, Mapping, And Human Motion Tracking. In Computer Vision - ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXXIX. 656--673.Google Scholar
- Ryosuke Hori, Ryo Hachiuma, Mariko Isogawa, Dan Mikami, and Hideo Saito. 2022. Silhouette-Based 3D Human Pose Estimation Using a Single Wrist-Mounted 360° Camera. IEEE Access 10 (2022), 54957--54968.Google ScholarCross Ref
- Yinghao Huang, Manuel Kaufmann, Emre Aksan, Michael J. Black, Otmar Hilliges, and Gerard Pons-Moll. 2018. Deep Inertial Poser Learning to Reconstruct Human Pose from SparseInertial Measurements in Real Time. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia) 37 (nov 2018).Google Scholar
- Hao Jiang and Kristen Grauman. 2017. Seeing Invisible Poses: Estimating 3D Body Pose from Egocentric Video. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- Jiaxi Jiang, Paul Streli, Huajian Qiu, Andreas Fender, Larissa Laich, Patrick Snape, and Christian Holz. 2022a. Avatarposer: Articulated full-body pose tracking from sparse motion sensing. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part V. Springer, 443--460.Google Scholar
- Yifeng Jiang, Yuting Ye, Deepak Gopinath, Jungdam Won, Alexander W. Winkler, and C. Karen Liu. 2022b. Transformer Inertial Poser: Real-Time Human Motion Reconstruction from Sparse IMUs with Simultaneous Terrain Generation. In SIGGRAPH Asia 2022 Conference Papers.Google Scholar
- Manuel Kaufmann, Yi Zhao, Chengcheng Tang, Lingling Tao, Christopher Twigg, Jie Song, Robert Wang, and Otmar Hilliges. 2021. Em-pose: 3d human pose estimation from sparse electromagnetic trackers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11510--11520.Google ScholarCross Ref
- Georg Klein and David Murray. 2007. Parallel Tracking and Mapping for Small AR Workspaces. In 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality. 225--234. Google ScholarDigital Library
- Lukas Koestler, Nan Yang, Niclas Zeller, and Daniel Cremers. 2022. Tandem: Tracking and dense mapping in real-time using deep multi-view stereo. In Conference on Robot Learning. PMLR, 34--45.Google Scholar
- Rainer Kümmerle, Giorgio Grisetti, Hauke Strasdat, Kurt Konolige, and Wolfram Burgard. 2011. G2o: A general framework for graph optimization. In 2011 IEEE International Conference on Robotics and Automation.Google ScholarCross Ref
- Stefan Leutenegger, Simon Lynen, Michael Bosse, Roland Siegwart, and Paul Furgale. 2015. Keyframe-based visual-inertial odometry using nonlinear optimization. The International Journal of Robotics Research 34, 3 (2015), 314--334.Google ScholarDigital Library
- Jiaman Li, C Karen Liu, and Jiajun Wu. 2022. Ego-Body Pose Estimation via Ego-Head Pose Estimation. arXiv preprint arXiv:2212.04636 (2022).Google Scholar
- Miao Liu, Dexin Yang, Yan Zhang, Zhaopeng Cui, James M. Rehg, and Siyu Tang. 2021. 4D Human Body Capture from Egocentric Video via 3D Scene Grounding. In 2021 International Conference on 3D Vision (3DV).Google ScholarCross Ref
- Yuxuan Liu, Jianxin Yang, Xiao Gu, Yao Guo, and Guang-Zhong Yang. 2022. Ego+X: An Egocentric Vision System for Global 3D Human Pose Estimation and Social Interaction Characterization. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 5271--5277.Google ScholarCross Ref
- Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. 2015. SMPL: A Skinned Multi-Person Linear Model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34 (oct 2015).Google ScholarDigital Library
- Zhengyi Luo, Ryo Hachiuma, Ye Yuan, and Kris Kitani. 2021. Dynamics-Regulated Kinematic Policy for Egocentric Pose Estimation. In Advances in Neural Information Processing Systems.Google Scholar
- Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. 2019. AMASS: Archive of Motion Capture as Surface Shapes. In The IEEE International Conference on Computer Vision (ICCV).Google Scholar
- Anastasios I Mourikis, Stergios I Roumeliotis, et al. 2007. A Multi-State Constraint Kalman Filter for Vision-aided Inertial Navigation.. In ICRA, Vol. 2. 6.Google Scholar
- Raul Mur-Artal, J. M. M. Montiel, and Juan D. Tardos. 2015. ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Transactions on Robotics 31, 5 (oct 2015), 1147--1163.Google ScholarDigital Library
- Raul Mur-Artal and Juan D Tardós. 2017a. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE transactions on robotics 33, 5 (2017), 1255--1262.Google ScholarDigital Library
- Raúl Mur-Artal and Juan D Tardós. 2017b. Visual-inertial monocular SLAM with map reuse. IEEE Robotics and Automation Letters 2, 2 (2017), 796--803.Google ScholarCross Ref
- Patrik Puchert and Timo Ropinski. 2021. Human Pose Estimation from Sparse Inertial Measurements through Recurrent Graph Convolution. CoRR abs/2107.11214 (2021). arXiv:2107.11214Google Scholar
- Pytorch. [n. d.]. Pytorch. Website. https://pytorch.org/.Google Scholar
- Tong Qin, Peiliang Li, and Shaojie Shen. 2018. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Transactions on Robotics 34, 4 (2018), 1004--1020.Google ScholarDigital Library
- N Dinesh Reddy, Laurent Guigues, Leonid Pishchulin, Jayan Eledath, and Srinivasa G Narasimhan. 2021. Tessetrack: End-to-end learnable multi-person articulated 3d pose tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15190--15200.Google ScholarCross Ref
- Helge Rhodin, Christian Richardt, Dan Casas, Eldar Insafutdinov, Mohammad Shafiei, Hans-Peter Seidel, Bernt Schiele, and Christian Theobalt. 2016. EgoCap: Egocentric Marker-Less Motion Capture with Two Fisheye Cameras. 35 (2016).Google ScholarDigital Library
- Qaiser Riaz, Guanhong Tao, Björn Krüger, and Andreas Weber. 2015. Motion Reconstruction Using Very Few Accelerometers and Ground Contacts. Graph. Models 79 (may 2015).Google Scholar
- Patrik Schmuck and Margarita Chli. 2017. Multi-uav collaborative monocular slam. In 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 3863--3870.Google ScholarDigital Library
- Ruizhi Shao, Zerong Zheng, Hongwen Zhang, Jingxiang Sun, and Yebin Liu. 2022. Diffustereo: High quality human reconstruction via diffusion-based stereo using sparse cameras. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXXII. Springer, 702--720.Google Scholar
- Takaaki Shiratori, Hyun Soo Park, Leonid Sigal, Yaser Sheikh, and Jessica K. Hodgins. 2011. Motion Capture from Body-Mounted Cameras. ACM Trans. Graph. 30, 4, Article 31 (jul 2011), 10 pages.Google ScholarDigital Library
- Ronit Slyper and Jessica Hodgins. 2008. Action Capture with Accelerometers. ACM SIGGRAPH/Eurographics Symposium on Computer Animation (01 2008).Google Scholar
- Jochen Tautges, Arno Zinke, Björn Krüger, Jan Baumann, Andreas Weber, Thomas Helten, Meinard Müller, Hans-Peter Seidel, and Bernhard Eberhardt. 2011. Motion Reconstruction Using Sparse Accelerometer Data. ACM Transactions on Graphics 30 (05 2011).Google Scholar
- Zachary Teed and Jia Deng. 2021. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in Neural Information Processing Systems 34 (2021), 16558--16569.Google Scholar
- Denis Tome, Thiemo Alldieck, Patrick Peluse, Gerard Pons-Moll, Lourdes Agapito, Hernan Badino, and Fernando de la Torre. 2020. SelfPose: 3D Egocentric Pose Estimation from a Headset Mounted Camera. IEEE Transactions on Pattern Analysis and Machine Intelligence (Oct 2020).Google Scholar
- Denis Tome, Patrick Peluse, Lourdes Agapito, and Hernan Badino. 2019. xr-egopose: Egocentric 3d human pose from an hmd camera. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).Google ScholarCross Ref
- Matthew Trumble, Andrew Gilbert, Charles Malleson, Adrian Hilton, and John Collomosse. 2017. Total Capture: 3D Human Pose Estimation Fusing Video and Inertial Sensors. In 2017 British Machine Vision Conference (BMVC).Google ScholarCross Ref
- Daniel Vlasic, Rolf Adelsberger, Giovanni Vannucci, John Barnwell, Markus Gross, Wojciech Matusik, and Jovan Popović. 2007. Practical Motion Capture in Everyday Surroundings. ACM Trans. Graph. 26 (jul 2007).Google Scholar
- Timo von Marcard, Roberto Henschel, Michael Black, Bodo Rosenhahn, and Gerard Pons-Moll. 2018. Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera. In European Conference on Computer Vision (ECCV).Google ScholarDigital Library
- Timo von Marcard, Bodo Rosenhahn, Michael Black, and Gerard Pons-Moll. 2017. Sparse Inertial Poser: Automatic 3D Human Pose Estimation from Sparse IMUs. Computer Graphics Forum 36(2), Proceedings of the 38th Annual Conference of the European Association for Computer Graphics (Eurographics) (2017).Google Scholar
- Lukas Von Stumberg, Vladyslav Usenko, and Daniel Cremers. 2018. Direct sparse visual-inertial odometry using dynamic marginalization. In 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2510--2517.Google ScholarDigital Library
- Jian Wang, Lingjie Liu, Weipeng Xu, Kripasindhu Sarkar, and Christian Theobalt. 2021. Estimating Egocentric 3D Human Pose in Global Space. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).Google ScholarCross Ref
- Jian Wang, Diogo Luvizon, Weipeng Xu, Lingjie Liu, Kripasindhu Sarkar, and Christian Theobalt. 2023. Scene-aware Egocentric 3D Human Pose Estimation. CVPR (2023).Google Scholar
- Alexander Winkler, Jungdam Won, and Yuting Ye. 2022. QuestSim: Human Motion Tracking from Sparse Sensors with Simulated Avatars. In SIGGRAPH Asia 2022 Conference Papers. 1--8.Google ScholarDigital Library
- Di Xia, Yeqing Zhu, and Heng Zhang. 2022. Faster Deep Inertial Pose Estimation with Six Inertial Sensors. Sensors 22, 19 (2022).Google Scholar
- Xsens. [n. d.]. Xsens 3D motion tracking. Website. https://www.xsens.com/.Google Scholar
- Weipeng Xu, Avishek Chatterjee, Michael Zollhoefer, Helge Rhodin, Pascal Fua, HansPeter Seidel, and Christian Theobalt. 2019. Mo2Cap2 : Real-time Mobile 3D Motion Capture with a Cap-mounted Fisheye Camera. IEEE Transactions on Visualization and Computer Graphics (2019).Google ScholarCross Ref
- Nan Yang, Lukas von Stumberg, Rui Wang, and Daniel Cremers. 2020. D3vo: Deep depth, deep pose and deep uncertainty for monocular visual odometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1281--1292.Google ScholarCross Ref
- Xinyu Yi, Yuxiao Zhou, Marc Habermann, Soshi Shimada, Vladislav Golyanik, Christian Theobalt, and Feng Xu. 2022. Physical Inertial Poser (PIP): Physics-aware Real-time Human Motion Tracking from Sparse Inertial Sensors. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- Xinyu Yi, Yuxiao Zhou, and Feng Xu. 2021. TransPose: Real-time 3D Human Translation and Pose Estimation with Six Inertial Sensors. ACM Transactions on Graphics 40 (08 2021).Google ScholarDigital Library
- Ye Yuan, Umar Iqbal, Pavlo Molchanov, Kris Kitani, and Jan Kautz. 2022. GLAMR: Global Occlusion-Aware Human Mesh Recovery with Dynamic Cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- Ye Yuan and Kris Kitani. 2019. Ego-Pose Estimation and Forecasting As Real-Time PD Control. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV).Google Scholar
- Ye Yuan and Kris M. Kitani. 2018. 3D Ego-Pose Estimation via Imitation Learning. In ECCV.Google Scholar
- Yuxiang Zhang, Liang An, Tao Yu, Xiu Li, Kun Li, and Yebin Liu. 2020. 4D Association Graph for Realtime Multi-Person Motion Capture Using Multiple Video Cameras. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- Jon Zubizarreta, Iker Aguinaga, and Jose Maria Martinez Montiel. 2020. Direct sparse mapping. IEEE Transactions on Robotics 36, 4 (2020), 1363--1370.Google ScholarDigital Library
Index Terms
- EgoLocate: Real-time Motion Capture, Localization, and Mapping with Sparse Body-mounted Sensors
Recommendations
Fusing Monocular Images and Sparse IMU Signals for Real-time Human Motion Capture
SA '23: SIGGRAPH Asia 2023 Conference PapersEither RGB images or inertial signals have been used for the task of motion capture (mocap), but combining them together is a new and interesting topic. We believe that the combination is complementary and able to solve the inherent difficulties of using ...
Motion capture from body-mounted cameras
Motion capture technology generally requires that recordings be performed in a laboratory or closed stage setting with controlled lighting. This restriction precludes the capture of motions that require an outdoor setting or the traversal of large ...
Motion capture from body-mounted cameras
SIGGRAPH '11: ACM SIGGRAPH 2011 papersMotion capture technology generally requires that recordings be performed in a laboratory or closed stage setting with controlled lighting. This restriction precludes the capture of motions that require an outdoor setting or the traversal of large ...
Comments