research-article

Bodyformer: Semantics-guided 3D Body Gesture Synthesis with Transformer

Authors:
Kunkun Pang

Guangdong Key Laboratory of Modern Control Technology, Institute of Intelligent Manufacturing, Guangdong Academy of Sciences, Guangzhou, China

Guangdong Key Laboratory of Modern Control Technology, Institute of Intelligent Manufacturing, Guangdong Academy of Sciences, Guangzhou, China

https://orcid.org/0000-0002-0168-3697
View Profile

,
Dafei Qin

University of Hong Kong, Pokfulam, Hong Kong

University of Hong Kong, Pokfulam, Hong Kong

https://orcid.org/0009-0001-4992-4760
View Profile

,
Yingruo Fan

University of Hong Kong, Pokfulam, Hong Kong

University of Hong Kong, Pokfulam, Hong Kong

https://orcid.org/0000-0001-8977-4958
View Profile

,
Julian Habekost

University of Edinburgh, Edinburgh, United Kingdom

University of Edinburgh, Edinburgh, United Kingdom

https://orcid.org/0009-0004-0894-5954
View Profile

,
Takaaki Shiratori

Meta Reality Labs, Pittsburgh, United States of America

Meta Reality Labs, Pittsburgh, United States of America

https://orcid.org/0000-0002-1012-415X
View Profile

,
Junichi Yamagishi

National Institute of Informatics, Tokyo, Japan

National Institute of Informatics, Tokyo, Japan

https://orcid.org/0000-0003-2752-3955
View Profile

,
Taku Komura

University of Hong Kong, Pokfulam, Hong Kong

Research Institute of Electrical Communication, Tohoku University, Sendai, Japan

University of Hong Kong, Pokfulam, Hong Kong

Research Institute of Electrical Communication, Tohoku University, Sendai, Japan

https://orcid.org/0000-0002-2729-5860
View Profile

Authors Info & Claims

ACM Transactions on Graphics Volume 42 Issue 4Article No.: 43pp 1–12https://doi.org/10.1145/3592456

Published:26 July 2023Publication History

ACM Transactions on Graphics

Abstract

Automatic gesture synthesis from speech is a topic that has attracted researchers for applications in remote communication, video games and Metaverse. Learning the mapping between speech and 3D full-body gestures is difficult due to the stochastic nature of the problem and the lack of a rich cross-modal dataset that is needed for training. In this paper, we propose a novel transformer-based framework for automatic 3D body gesture synthesis from speech. To learn the stochastic nature of the body gesture during speech, we propose a variational transformer to effectively model a probabilistic distribution over gestures, which can produce diverse gestures during inference. Furthermore, we introduce a mode positional embedding layer to capture the different motion speeds in different speaking modes. To cope with the scarcity of data, we design an intra-modal pre-training scheme that can learn the complex mapping between the speech and the 3D gesture from a limited amount of data. Our system is trained with either the Trinity speech-gesture dataset or the Talking With Hands 16.2M dataset. The results show that our system can produce more realistic, appropriate, and diverse body gestures compared to existing state-of-the-art approaches.

Supplemental Material

papers_862_VOD.mp4

presentation

mp4

191.9 MB

Download

Available for Download

zip

papers_862-supplemental.zip (186.2 MB)

supplemental material

References

Chaitanya Ahuja, D Lee, Ryo Ishii, and L-P Morency. 2020a. No gestures left behind: Learning relationships between spoken language and freeform gestures. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings (EMNLP).Google ScholarCross Ref
Chaitanya Ahuja, Dong Won Lee, Yukiko I Nakano, and Louis-Philippe Morency. 2020b. Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach. In European Conference on Computer Vision. Springer, 248--265.Google ScholarDigital Library
Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. 2020. Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows. Comput. Graph. Forum 39, 2 (2020), 487--496. Google ScholarCross Ref
Tenglong Ao, Qingzhe Gao, Yuke Lou, Baoquan Chen, and Libin Liu. 2022. Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings. ACM Trans. Graph. 41, 6, Article 209 (nov 2022), 19 pages. Google ScholarDigital Library
Uttaran Bhattacharya, Elizabeth Childs, Nicholas Rewkowski, and Dinesh Manocha. 2021a. Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial Affective Expression Learning. In Proceedings of the 29th ACM International Conference on Multimedia (Virtual Event, China) (MM '21). Association for Computing Machinery, New York, NY, USA, 2027--2036. Google ScholarDigital Library
Uttaran Bhattacharya, Nicholas Rewkowski, Abhishek Banerjee, Pooja Guhan, Aniket Bera, and Dinesh Manocha. 2021b. Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents. In 2021 IEEE Virtual Reality and 3D User Interfaces (VR). IEEE, 1--10.Google Scholar
Justine Cassell, David McNeill, and Karl-Erik McCullough. 1999. Speech-gesture mismatches: Evidence for one underlying representation of linguistic and nonlinguistic information. Pragmatics & cognition 7, 1 (1999), 1--34.Google Scholar
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 2978--2988. Google ScholarCross Ref
Jan P De Ruiter, Adrian Bangerter, and Paula Dings. 2012. The interplay between gesture and speech in the production of referring expressions: Investigating the tradeoff hypothesis. Topics in Cognitive Science 4, 2 (2012), 232--248.Google ScholarCross Ref
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.Google ScholarCross Ref
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186. Google ScholarCross Ref
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations. https://openreview.net/forum?id=YicbFdNTTyGoogle Scholar
Ylva Ferstl and Rachel McDonnell. 2018. Investigating the use of recurrent motion modelling for speech gesture generation. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. https://trinityspeechgesture.scss.tcd.ieGoogle ScholarDigital Library
Ylva Ferstl, Michael Neff, and Rachel McDonnell. 2019. Multi-objective adversarial gesture generation. In Motion, Interaction and Games. ACM, 3.Google Scholar
Hao Fu, Chunyuan Li, Xiaodong Liu, Jianfeng Gao, Asli Celikyilmaz, and Lawrence Carin. 2019. Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 240--250. Google ScholarCross Ref
F.A. Gers, J. Schmidhuber, and F. Cummins. 1999. Learning to forget: continual prediction with LSTM. In 1999 Ninth International Conference on Artificial Neural Networks ICANN 99. (Conf. Publ. No. 470), Vol. 2. 850--855 vol.2. Google ScholarCross Ref
Saeed Ghorbani, Ylva Ferstl, Daniel Holden, Nikolaus F. Troje, and Marc-André Carbonneau. 2023. ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech. Computer Graphics Forum 42, 1 (2023), 206--216. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.14734 Google ScholarCross Ref
Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik. 2019. Learning Individual Styles of Conversational Gesture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3497--3506.Google ScholarCross Ref
Ikhsanul Habibie, Mohamed Elgharib, Kripashindu Sarkar, Ahsan Abdullah, Simbarashe Nyatsanga, Michael Neff, and Christian Theobalt. 2022. A Motion Matching-based Framework for Controllable Gesture Synthesis from Speech. In SIGGRAPH '22 Conference Proceedings.Google ScholarDigital Library
Ikhsanul Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, Hans-Peter Seidel, Gerard Pons-Moll, Mohamed Elgharib, and Christian Theobalt. 2021. Learning speech-driven 3D conversational gestures from video. In Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents. 101--108.Google ScholarDigital Library
Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow. 2020. MoGlow: Probabilistic and controllable motion synthesis using normalising flows. ACM Transactions on Graphics 39, 4 (2020), 236:1--236:14. Google ScholarDigital Library
Adam Kendon. 1972. Some relationships between body motion and speech. Studies in dyadic communication 7, 177 (1972), 90.Google Scholar
Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2021. Transformers in vision: A survey. ACM Computing Surveys (CSUR) (2021).Google ScholarDigital Library
Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, and Hedvig Kjellström. 2019. Analyzing input and output representations for speech-driven gesture generation. In Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents. 97--104.Google ScholarDigital Library
Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexandersson, Iolanda Leite, and Hedvig Kjellström. 2020. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the 2020 International Conference on Multimodal Interaction. 242--250.Google ScholarDigital Library
Taras Kucherenko, Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, and Gustav Eje Henter. 2021. A Large, Crowdsourced Evaluation of Gesture Generation Systems on Common Data: The GENEA Challenge 2020. In 26th International Conference on Intelligent User Interfaces (College Station, TX, USA) (IUI '21). Association for Computing Machinery, New York, NY, USA, 11--21. Google ScholarDigital Library
Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S. Srinivasa, and Yaser Sheikh. 2019a. Talking With Hands 16.2M: A Large-Scale Dataset of Synchronized Body-Finger Motion and Audio for Conversational Motion Analysis and Synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).Google ScholarCross Ref
Juho Lee, Yoonho Lee, Jungtaek Kim, Adam R. Kosiorek, Seungjin Choi, and Yee Whye Teh. 2019b. Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks. In Proceedings of International Conference on Machine Learning. 3744--3753.Google Scholar
Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Zhenyu He, and Linchao Bao. 2021a. Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11293--11302.Google ScholarCross Ref
Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. 2021b. Learn to Dance with AIST++: Music Conditioned 3D Dance Generation. In Proceedings of IEEE/CVF International Conference on Computer Vision.Google Scholar
Haiyang Liu, Naoya Iwamoto, Zihao Zhu, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. 2022a. DisCo: Disentangled Implicit Content and Rhythm Learning for Diverse Co-Speech Gestures Synthesis. In Proceedings of the 30th ACM International Conference on Multimedia (Lisboa, Portugal) (MM '22). Association for Computing Machinery, New York, NY, USA, 3764--3773. Google ScholarDigital Library
Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. 2022c. BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis. arXiv preprint arXiv:2203.05297 (2022).Google Scholar
Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, and Bolei Zhou. 2022b. Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation. arXiv preprint arXiv:2203.13161 (2022).Google Scholar
Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations. https://openreview.net/forum?id=Bkg6RiCqY7Google Scholar
Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. 2015. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, Vol. 8.Google ScholarCross Ref
David McNeill. 1992. Hand and mind: What gestures reveal about thought. University of Chicago press.Google Scholar
Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin Dogus Cubuk, and Quoc V. Le. 2019. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In INTERSPEECH.Google Scholar
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024--8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdfGoogle ScholarDigital Library
Shenhan Qian, Zhi Tu, Yihao Zhi, Wen Liu, and Shenghua Gao. 2021. Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11077--11086.Google ScholarCross Ref
Eli Shlizerman, Lucio Dery, Hayden Schoen, and Ira Kemelmacher-Shlizerman. 2018. Audio to body dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7574--7583.Google ScholarCross Ref
Guillermo Valle-Pérez, Gustav Eje Henter, Jonas Beskow, Andre Holzapfel, Pierre-Yves Oudeyer, and Simon Alexanderson. 2021. Transflower: Probabilistic Autoregressive Dance Generation with Multimodal Attention. ACM Trans. Graph. 40, 6, Article 195 (dec 2021), 14 pages. Google ScholarDigital Library
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdfGoogle ScholarDigital Library
Petra Wagner, Zofia Malisz, and Stefan Kopp. 2014. Gesture and speech in interaction: An overview.Google Scholar
Benyou Wang, Lifeng Shang, Christina Lioma, Xin Jiang, Hao Yang, Qun Liu, and Jakob Grue Simonsen. 2021b. On Position Embeddings in BERT. In Proceedings of International Conference on Learning Representations. https://openreview.net/forum?id=onxoVA9FxMwGoogle Scholar
Siyang Wang, Simon Alexanderson, Joakim Gustafson, Jonas Beskow, Gustav Eje Henter, and Éva Székely. 2021a. Integrated Speech and Gesture Synthesis. In Proceedings of the 2021 International Conference on Multimodal Interaction (Montréal, QC, Canada) (ICMI '21). Association for Computing Machinery, New York, NY, USA, 177--185. Google ScholarDigital Library
Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2020. Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity. ACM Trans. Graph. 39, 6 (nov 2020).Google ScholarDigital Library
Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 4303--4309.Google ScholarDigital Library
He Zhang, Sebastian Starke, Taku Komura, and Jun Saito. 2018. Mode-adaptive neural networks for quadruped motion control. ACM Transactions on Graphics (TOG) 37, 4 (2018), 145.Google ScholarDigital Library
Yi Zhou, Connelly Barnes, Lu Jingwan, Yang Jimei, and Li Hao. 2019. On the Continuity of Rotation Representations in Neural Networks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
Lingting Zhu, Xian Liu, Xuanyu Liu, Rui Qian, Ziwei Liu, and Lequan Yu. 2023. Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning Books and Movies: Towards StoryLike Visual Explanations by Watching Movies and Reading Books. In Proceedings of IEEE/CVF International Conference on Computer Vision. IEEE Computer Society, 19--27.Google ScholarDigital Library

Index Terms

Bodyformer: Semantics-guided 3D Body Gesture Synthesis with Transformer
1. Computing methodologies
  1. Computer graphics
    1. Animation
  2. Machine learning

Recommendations

AQ-GT: a Temporally Aligned and Quantized GRU-Transformer for Co-Speech Gesture Synthesis
ICMI '23: Proceedings of the 25th International Conference on Multimodal Interaction

The generation of realistic and contextually relevant co-speech gestures is a challenging yet increasingly important task in the creation of multimodal artificial agents. Prior methods focused on learning a direct correspondence between co-speech ...
Read More
SAGA: Stochastic Whole-Body Grasping with Contact
Computer Vision – ECCV 2022
Abstract
The synthesis of human grasping has numerous applications including AR/VR, video games and robotics. While methods have been proposed to generate realistic hand–object interaction for object grasping and manipulation, these typically only consider ...
Read More
Human-virtual human interaction by upper body gesture understanding
VRST '13: Proceedings of the 19th ACM Symposium on Virtual Reality Software and Technology

In this paper, a novel human-virtual human interaction system is proposed. This system supports a real human to communicate with a virtual human using natural body language. Meanwhile, the virtual human is capable of understanding the meaning of human ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Graphics Volume 42, Issue 4
August 2023
1912 pages
ISSN:0730-0301
EISSN:1557-7368
DOI:10.1145/3609020
Issue’s Table of Contents

Copyright © 2023 ACM
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 July 2023
Published in tog Volume 42, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
motion generation
transformer
deep learning
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 342
  Total Downloads
- Downloads (Last 12 months)342
- Downloads (Last 6 weeks)29
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Bodyformer: Semantics-guided 3D Body Gesture Synthesis with Transformer

ACM Transactions on Graphics

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

AQ-GT: a Temporally Aligned and Quantized GRU-Transformer for Co-Speech Gesture Synthesis

SAGA: Stochastic Whole-Body Grasping with Contact

Human-virtual human interaction by upper body gesture understanding

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Bodyformer: Semantics-guided 3D Body Gesture Synthesis with Transformer

ACM Transactions on Graphics

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

AQ-GT: a Temporally Aligned and Quantized GRU-Transformer for Co-Speech Gesture Synthesis

SAGA: Stochastic Whole-Body Grasping with Contact

Human-virtual human interaction by upper body gesture understanding

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media