skip to main content
10.1145/3539618.3592069acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper
Open Access
Honorable Mention Short Paper

Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language

Published:18 July 2023Publication History

ABSTRACT

Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available here: https://github.com/mesnico/text-to-motion-retrieval.

Skip Supplemental Material Section

Supplemental Material

SIGIR23-srp0809-720p.mp4

mp4

140.3 MB

References

  1. Emre Aksan, Manuel Kaufmann, Peng Cao, and Otmar Hilliges. 2020. A Spatio-temporal Transformer for 3D Human Motion Prediction. arXiv (2020). https://doi.org/10.48550/ARXIV.2004.08692Google ScholarGoogle ScholarCross RefCross Ref
  2. Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision. Springer, 382--398.Google ScholarGoogle ScholarCross RefCross Ref
  3. Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luvcić, and Cordelia Schmid. 2021. ViViT: A Video Vision Transformer. In IEEE/CVF International Conference on Computer Vision (ICCV). 6836--6846.Google ScholarGoogle Scholar
  4. J. Bernard, N. Wilhelm, B. Krü ger, T. May, T. Schreck, and J. Kohlhammer. 2013. MotionExplorer: Exploratory Search in Human Motion Capture Data Based on Hierarchical Aggregation. IEEE Transactions on Visualization and Computer Graphics, Vol. 19, 12 (2013), 2257--2266.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Petra Budikova, Jan Sedmidubsky, and Pavel Zezula. 2021. Efficient Indexing of 3D Human Motions. In International Conference on Multimedia Retrieval (ICMR). ACM, 10--18. https://dl.acm.org/doi/10.1145/3460426.3463646Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Fabio Carrara, Petr Elias, Jan Sedmidubsky, and Pavel Zezula. 2019. LSTM-based real-time action detection and prediction in human motion streams. Multimedia Tools and Applications, Vol. 78, 19 (2019), 27309--27331. https://doi.org/10.1007/s11042-019-07827--3Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Fabio Carrara, Andrea Esuli, Tiziano Fagni, Fabrizio Falchi, and Alejandro Moreo Fernández. 2018. Picture it in your mind: Generating high level visual representations from textual descriptions. Information Retrieval Journal, Vol. 21, 2 (2018), 208--229.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Yi-Bin Cheng, Xipeng Chen, Junhong Chen, Pengxu Wei, Dongyu Zhang, and Liang Lin. 2021a. Hierarchical Transformer: Unsupervised Representation Learning for Skeleton-Based Human Action Recognition. In IEEE International Conference on Multimedia and Expo (ICME). 1--6. https://doi.org/10.1109/ICME51207.2021.9428459Google ScholarGoogle ScholarCross RefCross Ref
  9. Yi-Bin Cheng, Xipeng Chen, Dongyu Zhang, and Liang Lin. 2021b. Motion-Transformer: Self-Supervised Pre-Training for Skeleton-Based Action Recognition. In 2nd ACM International Conference on Multimedia in Asia (MMAsia). ACM, New York, NY, USA. https://doi.org/10.1145/3444685.3446289Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Z. Deng, Q. Gu, and Q. Li. 2009. Perceptually consistent example-based human motion retrieval. In Symposium on Interactive 3D Graphics (SI3D). ACM, 191--198.Google ScholarGoogle Scholar
  11. Haodong Duan, Jiaqi Wang, Kai Chen, and Dahua Lin. 2022. DG-STGCN: Dynamic Spatial-Temporal Modeling for Skeleton-based Action Recognition. arXiv (2022). https://doi.org/10.48550/ARXIV.2210.05895Google ScholarGoogle ScholarCross RefCross Ref
  12. Shradha Dubey and Manish Dixit. 2022. A comprehensive survey on human pose estimation approaches. Multimedia Systems (2022), 1--29. https://doi.org/10.1007/s00530-022-00980-0Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. 2021. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097 (2021).Google ScholarGoogle Scholar
  14. Anindita Ghosh, Noshaba Cheema, Cennet Oguz, Christian Theobalt, and Philipp Slusallek. 2021. Synthesis of compositional animations from textual descriptions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1396--1406.Google ScholarGoogle ScholarCross RefCross Ref
  15. Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. 2022a. Generating Diverse and Natural 3D Human Motions From Text. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5152--5161.Google ScholarGoogle ScholarCross RefCross Ref
  16. Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. 2022b. TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts. In European Conference on Computer Vision (ECCV). Springer Nature Switzerland, Cham, 580--597.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. 2020. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia. 2021--2029.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Kapadia, I-K. Chiang, T. Thomas, N.I. Badler, and J. T. Kider Jr. 2013. Efficient motion retrieval in large motion databases. In Symposium on Interactive 3D Graphics and Games (I3D). ACM, 19--28.Google ScholarGoogle Scholar
  19. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171--4186.Google ScholarGoogle Scholar
  20. Jihoon Kim, Youngjae Yu, Seungyoun Shin, Taehyun Byun, and Sungjoon Choi. 2022. Learning Joint Representation of Human Motion and Language. arXiv preprint arXiv:2210.15187 (2022).Google ScholarGoogle Scholar
  21. Lilang Lin, Sijie Song, Wenhan Yang, and Jiaying Liu. 2020. MS2L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition. In 28th ACM International Conference on Multimedia (MM). ACM, New York, NY, USA, 2490--2498. https://doi.org/10.1145/3394171.3413548Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Yu Liu, Huai Chen, Lianghua Huang, Di Chen, Bin Wang, Pan Pan, and Lisheng Wang. 2022. Animating Images to Transfer CLIP for Video-Text Retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1906--1911.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing, Vol. 508 (2022), 293--304.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Na Lv, Ying Wang, Zhiquan Feng, and Jingliang Peng. 2021. Deep Hashing for Motion Capture Data Retrieval. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2215--2219. https://doi.org/10.1109/ICASSP39728.2021.9413505Google ScholarGoogle ScholarCross RefCross Ref
  25. Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. 2019. AMASS: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision. 5442--5451.Google ScholarGoogle ScholarCross RefCross Ref
  26. Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, and Stéphane Marchand-Maillet. 2021a. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 17, 4 (2021), 1--23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Nicola Messina, Giuseppe Amato, Fabrizio Falchi, Claudio Gennaro, and Stéphane Marchand-Maillet. 2021b. Towards efficient cross-modal visual textual retrieval using transformer-encoder deep features. In 2021 International Conference on Content-Based Multimedia Indexing (CBMI). IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  28. Nicola Messina, Davide Alessandro Coccomini, Andrea Esuli, and Fabrizio Falchi. 2022a. Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia Image-Caption Matching. arXiv preprint arXiv:2206.10436 (2022).Google ScholarGoogle Scholar
  29. Nicola Messina, Fabrizio Falchi, Andrea Esuli, and Giuseppe Amato. 2021c. Transformer reasoning network for image-text matching and retrieval. In 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 5222--5229.Google ScholarGoogle ScholarCross RefCross Ref
  30. Nicola Messina, Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, Giuseppe Amato, and Rita Cucchiara. 2022b. ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval. In Proceedings of the 19th International Conference on Content-based Multimedia Indexing. 64--70.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. N. Numaguchi, A. Nakazawa, T. Shiratori, and J. K. Hodgins. 2011. A Puppet Interface for Retrieval of Motion Capture Data. In Eurographics/ACM SIGGRAPH Symposium on Computer Animation (SCA). Eurographics Assoc., 157--166.Google ScholarGoogle Scholar
  32. Konstantinos Papadopoulos, Enjie Ghorbel, Renato Baptista, Djamila Aouada, and Bjö rn E. Ottersten. 2019. Two-Stage RGB-Based Action Detection Using Augmented 3D Poses. In 18th International Conference on Computer Analysis of Images and Patterns (CAIP), Vol. 11678. Springer, 26--35. https://doi.org/10.1007/978--3-030--29888--3_3Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Wei Peng, Xiaopeng Hong, and Guoying Zhao. 2021. Tripool: Graph Triplet Pooling for 3D Skeleton-Based Action Recognition. Pattern Recognition, Vol. 115 (2021), 107921. https://doi.org/10.1016/j.patcog.2021.107921Google ScholarGoogle ScholarCross RefCross Ref
  34. Mathis Petrovich, Michael J. Black, and Gül Varol. 2021. Action-Conditioned 3D Human Motion Synthesis With Transformer VAE. In IEEE/CVF International Conference on Computer Vision (ICCV). 10985--10995.Google ScholarGoogle ScholarCross RefCross Ref
  35. Matthias Plappert, Christian Mandery, and Tamim Asfour. 2016. The KIT Motion-Language Dataset. Big Data, Vol. 4, 4 (2016), 236--252. https://doi.org/10.1089/big.2016.0028Google ScholarGoogle ScholarCross RefCross Ref
  36. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arXiv (2021). https://doi.org/10.48550/ARXIV.2103.00020Google ScholarGoogle ScholarCross RefCross Ref
  37. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, Vol. 1, 8 (2019), 9.Google ScholarGoogle Scholar
  38. Jan Sedmidubsky, Petra Budikova, Vlastislav Dohnal, and Pavel Zezula. 2020. Motion Words: A Text-like Representation of 3D Skeleton Sequences. In 42nd European Conference on Information Retrieval (ECIR). Springer, 527--541.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Jan Sedmidubsky, Fabio Carrara, and Giuseppe Amato. 2023. SegmentCodeList: Unsupervised Representation Learning for Human Skeleton Data Retrieval. In 45th European Conference on Information Retrieval (ECIR). Springer, Cham, 110--124. https://doi.org/10.1007/978--3-031--28238--6_8Google ScholarGoogle ScholarCross RefCross Ref
  40. Jan Sedmidubsky, Petr Elias, Petra Budikova, and Pavel Zezula. 2021. Content-based Management of Human Motion Data: Survey and Challenges. IEEE Access, Vol. 9 (2021), 64241--64255. https://doi.org/10.1109/ACCESS.2021.3075766Google ScholarGoogle ScholarCross RefCross Ref
  41. Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio S Feris, David Harwath, James Glass, and Hilde Kuehne. 2022. Everything at once-multi-modal fusion transformer for video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20020--20029.Google ScholarGoogle ScholarCross RefCross Ref
  42. Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu. 2018. Spatio-Temporal Attention-Based LS™ Networks for 3D Action Recognition and Detection. IEEE Transactions on Image Processing, Vol. 27, 7 (2018), 3459--3471. https://doi.org/10.1109/TIP.2018.2818328Google ScholarGoogle ScholarCross RefCross Ref
  43. Ömer Terlemez, Stefan Ulbrich, Christian Mandery, Martin Do, Nikolaus Vahrenkamp, and Tamim Asfour. 2014. Master Motor Map (MMM)-Framework and toolkit for capturing, representing, and reproducing human motion on humanoid robots. In 2014 IEEE-RAS International Conference on Humanoid Robots. IEEE, 894--901.Google ScholarGoogle ScholarCross RefCross Ref
  44. Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H. Bermano. 2022. Human Motion Diffusion Model. arXiv (2022), 1--12. https://doi.org/10.48550/ARXIV.2209.14916Google ScholarGoogle ScholarCross RefCross Ref
  45. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).Google ScholarGoogle Scholar
  46. Yang Yang, Guangjun Liu, and Xuehao Gao. 2022. Motion Guided Attention Learning for Self-Supervised 3D Human Action Recognition. IEEE Transactions on Circuits and Systems for Video Technology (2022), 1--13. https://doi.org/10.1109/TCSVT.2022.3194350Google ScholarGoogle ScholarCross RefCross Ref
  47. Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. 2023. T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations. arXiv (2023), 1--14. https://doi.org/10.48550/ARXIV.2301.06052Google ScholarGoogle ScholarCross RefCross Ref
  48. Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. 2022. MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model. arXiv (2022), 1--16. https://doi.org/10.48550/ARXIV.2208.15001Google ScholarGoogle ScholarCross RefCross Ref
  49. Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D Manning, and Curtis P Langlotz. 2020. Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747 (2020).Google ScholarGoogle Scholar
  50. Shuai Zhao, Linchao Zhu, Xiaohan Wang, and Yi Yang. 2022. Centerclip: Token clustering for efficient text-video retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 970--981.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Article Metrics

            • Downloads (Last 12 months)562
            • Downloads (Last 6 weeks)97

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader