ABSTRACT
Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available here: https://github.com/mesnico/text-to-motion-retrieval.
Supplemental Material
- Emre Aksan, Manuel Kaufmann, Peng Cao, and Otmar Hilliges. 2020. A Spatio-temporal Transformer for 3D Human Motion Prediction. arXiv (2020). https://doi.org/10.48550/ARXIV.2004.08692Google ScholarCross Ref
- Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision. Springer, 382--398.Google ScholarCross Ref
- Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luvcić, and Cordelia Schmid. 2021. ViViT: A Video Vision Transformer. In IEEE/CVF International Conference on Computer Vision (ICCV). 6836--6846.Google Scholar
- J. Bernard, N. Wilhelm, B. Krü ger, T. May, T. Schreck, and J. Kohlhammer. 2013. MotionExplorer: Exploratory Search in Human Motion Capture Data Based on Hierarchical Aggregation. IEEE Transactions on Visualization and Computer Graphics, Vol. 19, 12 (2013), 2257--2266.Google ScholarDigital Library
- Petra Budikova, Jan Sedmidubsky, and Pavel Zezula. 2021. Efficient Indexing of 3D Human Motions. In International Conference on Multimedia Retrieval (ICMR). ACM, 10--18. https://dl.acm.org/doi/10.1145/3460426.3463646Google ScholarDigital Library
- Fabio Carrara, Petr Elias, Jan Sedmidubsky, and Pavel Zezula. 2019. LSTM-based real-time action detection and prediction in human motion streams. Multimedia Tools and Applications, Vol. 78, 19 (2019), 27309--27331. https://doi.org/10.1007/s11042-019-07827--3Google ScholarDigital Library
- Fabio Carrara, Andrea Esuli, Tiziano Fagni, Fabrizio Falchi, and Alejandro Moreo Fernández. 2018. Picture it in your mind: Generating high level visual representations from textual descriptions. Information Retrieval Journal, Vol. 21, 2 (2018), 208--229.Google ScholarDigital Library
- Yi-Bin Cheng, Xipeng Chen, Junhong Chen, Pengxu Wei, Dongyu Zhang, and Liang Lin. 2021a. Hierarchical Transformer: Unsupervised Representation Learning for Skeleton-Based Human Action Recognition. In IEEE International Conference on Multimedia and Expo (ICME). 1--6. https://doi.org/10.1109/ICME51207.2021.9428459Google ScholarCross Ref
- Yi-Bin Cheng, Xipeng Chen, Dongyu Zhang, and Liang Lin. 2021b. Motion-Transformer: Self-Supervised Pre-Training for Skeleton-Based Action Recognition. In 2nd ACM International Conference on Multimedia in Asia (MMAsia). ACM, New York, NY, USA. https://doi.org/10.1145/3444685.3446289Google ScholarDigital Library
- Z. Deng, Q. Gu, and Q. Li. 2009. Perceptually consistent example-based human motion retrieval. In Symposium on Interactive 3D Graphics (SI3D). ACM, 191--198.Google Scholar
- Haodong Duan, Jiaqi Wang, Kai Chen, and Dahua Lin. 2022. DG-STGCN: Dynamic Spatial-Temporal Modeling for Skeleton-based Action Recognition. arXiv (2022). https://doi.org/10.48550/ARXIV.2210.05895Google ScholarCross Ref
- Shradha Dubey and Manish Dixit. 2022. A comprehensive survey on human pose estimation approaches. Multimedia Systems (2022), 1--29. https://doi.org/10.1007/s00530-022-00980-0Google ScholarDigital Library
- Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. 2021. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097 (2021).Google Scholar
- Anindita Ghosh, Noshaba Cheema, Cennet Oguz, Christian Theobalt, and Philipp Slusallek. 2021. Synthesis of compositional animations from textual descriptions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1396--1406.Google ScholarCross Ref
- Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. 2022a. Generating Diverse and Natural 3D Human Motions From Text. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5152--5161.Google ScholarCross Ref
- Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. 2022b. TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts. In European Conference on Computer Vision (ECCV). Springer Nature Switzerland, Cham, 580--597.Google ScholarDigital Library
- Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. 2020. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia. 2021--2029.Google ScholarDigital Library
- M. Kapadia, I-K. Chiang, T. Thomas, N.I. Badler, and J. T. Kider Jr. 2013. Efficient motion retrieval in large motion databases. In Symposium on Interactive 3D Graphics and Games (I3D). ACM, 19--28.Google Scholar
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171--4186.Google Scholar
- Jihoon Kim, Youngjae Yu, Seungyoun Shin, Taehyun Byun, and Sungjoon Choi. 2022. Learning Joint Representation of Human Motion and Language. arXiv preprint arXiv:2210.15187 (2022).Google Scholar
- Lilang Lin, Sijie Song, Wenhan Yang, and Jiaying Liu. 2020. MS2L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition. In 28th ACM International Conference on Multimedia (MM). ACM, New York, NY, USA, 2490--2498. https://doi.org/10.1145/3394171.3413548Google ScholarDigital Library
- Yu Liu, Huai Chen, Lianghua Huang, Di Chen, Bin Wang, Pan Pan, and Lisheng Wang. 2022. Animating Images to Transfer CLIP for Video-Text Retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1906--1911.Google ScholarDigital Library
- Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing, Vol. 508 (2022), 293--304.Google ScholarDigital Library
- Na Lv, Ying Wang, Zhiquan Feng, and Jingliang Peng. 2021. Deep Hashing for Motion Capture Data Retrieval. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2215--2219. https://doi.org/10.1109/ICASSP39728.2021.9413505Google ScholarCross Ref
- Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. 2019. AMASS: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision. 5442--5451.Google ScholarCross Ref
- Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, and Stéphane Marchand-Maillet. 2021a. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 17, 4 (2021), 1--23.Google ScholarDigital Library
- Nicola Messina, Giuseppe Amato, Fabrizio Falchi, Claudio Gennaro, and Stéphane Marchand-Maillet. 2021b. Towards efficient cross-modal visual textual retrieval using transformer-encoder deep features. In 2021 International Conference on Content-Based Multimedia Indexing (CBMI). IEEE, 1--6.Google ScholarCross Ref
- Nicola Messina, Davide Alessandro Coccomini, Andrea Esuli, and Fabrizio Falchi. 2022a. Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia Image-Caption Matching. arXiv preprint arXiv:2206.10436 (2022).Google Scholar
- Nicola Messina, Fabrizio Falchi, Andrea Esuli, and Giuseppe Amato. 2021c. Transformer reasoning network for image-text matching and retrieval. In 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 5222--5229.Google ScholarCross Ref
- Nicola Messina, Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, Giuseppe Amato, and Rita Cucchiara. 2022b. ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval. In Proceedings of the 19th International Conference on Content-based Multimedia Indexing. 64--70.Google ScholarDigital Library
- N. Numaguchi, A. Nakazawa, T. Shiratori, and J. K. Hodgins. 2011. A Puppet Interface for Retrieval of Motion Capture Data. In Eurographics/ACM SIGGRAPH Symposium on Computer Animation (SCA). Eurographics Assoc., 157--166.Google Scholar
- Konstantinos Papadopoulos, Enjie Ghorbel, Renato Baptista, Djamila Aouada, and Bjö rn E. Ottersten. 2019. Two-Stage RGB-Based Action Detection Using Augmented 3D Poses. In 18th International Conference on Computer Analysis of Images and Patterns (CAIP), Vol. 11678. Springer, 26--35. https://doi.org/10.1007/978--3-030--29888--3_3Google ScholarDigital Library
- Wei Peng, Xiaopeng Hong, and Guoying Zhao. 2021. Tripool: Graph Triplet Pooling for 3D Skeleton-Based Action Recognition. Pattern Recognition, Vol. 115 (2021), 107921. https://doi.org/10.1016/j.patcog.2021.107921Google ScholarCross Ref
- Mathis Petrovich, Michael J. Black, and Gül Varol. 2021. Action-Conditioned 3D Human Motion Synthesis With Transformer VAE. In IEEE/CVF International Conference on Computer Vision (ICCV). 10985--10995.Google ScholarCross Ref
- Matthias Plappert, Christian Mandery, and Tamim Asfour. 2016. The KIT Motion-Language Dataset. Big Data, Vol. 4, 4 (2016), 236--252. https://doi.org/10.1089/big.2016.0028Google ScholarCross Ref
- Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arXiv (2021). https://doi.org/10.48550/ARXIV.2103.00020Google ScholarCross Ref
- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, Vol. 1, 8 (2019), 9.Google Scholar
- Jan Sedmidubsky, Petra Budikova, Vlastislav Dohnal, and Pavel Zezula. 2020. Motion Words: A Text-like Representation of 3D Skeleton Sequences. In 42nd European Conference on Information Retrieval (ECIR). Springer, 527--541.Google ScholarDigital Library
- Jan Sedmidubsky, Fabio Carrara, and Giuseppe Amato. 2023. SegmentCodeList: Unsupervised Representation Learning for Human Skeleton Data Retrieval. In 45th European Conference on Information Retrieval (ECIR). Springer, Cham, 110--124. https://doi.org/10.1007/978--3-031--28238--6_8Google ScholarCross Ref
- Jan Sedmidubsky, Petr Elias, Petra Budikova, and Pavel Zezula. 2021. Content-based Management of Human Motion Data: Survey and Challenges. IEEE Access, Vol. 9 (2021), 64241--64255. https://doi.org/10.1109/ACCESS.2021.3075766Google ScholarCross Ref
- Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio S Feris, David Harwath, James Glass, and Hilde Kuehne. 2022. Everything at once-multi-modal fusion transformer for video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20020--20029.Google ScholarCross Ref
- Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu. 2018. Spatio-Temporal Attention-Based LS™ Networks for 3D Action Recognition and Detection. IEEE Transactions on Image Processing, Vol. 27, 7 (2018), 3459--3471. https://doi.org/10.1109/TIP.2018.2818328Google ScholarCross Ref
- Ömer Terlemez, Stefan Ulbrich, Christian Mandery, Martin Do, Nikolaus Vahrenkamp, and Tamim Asfour. 2014. Master Motor Map (MMM)-Framework and toolkit for capturing, representing, and reproducing human motion on humanoid robots. In 2014 IEEE-RAS International Conference on Humanoid Robots. IEEE, 894--901.Google ScholarCross Ref
- Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H. Bermano. 2022. Human Motion Diffusion Model. arXiv (2022), 1--12. https://doi.org/10.48550/ARXIV.2209.14916Google ScholarCross Ref
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).Google Scholar
- Yang Yang, Guangjun Liu, and Xuehao Gao. 2022. Motion Guided Attention Learning for Self-Supervised 3D Human Action Recognition. IEEE Transactions on Circuits and Systems for Video Technology (2022), 1--13. https://doi.org/10.1109/TCSVT.2022.3194350Google ScholarCross Ref
- Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. 2023. T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations. arXiv (2023), 1--14. https://doi.org/10.48550/ARXIV.2301.06052Google ScholarCross Ref
- Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. 2022. MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model. arXiv (2022), 1--16. https://doi.org/10.48550/ARXIV.2208.15001Google ScholarCross Ref
- Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D Manning, and Curtis P Langlotz. 2020. Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747 (2020).Google Scholar
- Shuai Zhao, Linchao Zhu, Xiaohan Wang, and Yi Yang. 2022. Centerclip: Token clustering for efficient text-video retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 970--981.Google ScholarDigital Library
Index Terms
- Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language
Recommendations
Motion retrieval based on movement notation language: Motion Capture and Retrieval
CASA 2005With the increased availability of motion capture data, the volume of motion library grows so large that it is difficult for animators to manually browse the dataset to search desired motions for reuse. To address this issue, we implement a framework, ...
Motion Data Retrieval from Very Large Motion Databases
ICVRV '11: Proceedings of the 2011 International Conference on Virtual Reality and VisualizationThe reuse of motion capture data has become an important way to generate realistic motions. Retrieval of similar motion segments from large motion datasets accordingly serves as a fundamental problem for data-based motion processing methods. The ...
Text-Like motion representation for human motion retrieval
IScIDE'12: Proceedings of the third Sino-foreign-interchange conference on Intelligent Science and Intelligent Data EngineeringHuman motion capture (mo-cap) data has been increasingly applied in animation, movies and games in recent years due to its visual realism, and large amounts of them were accumulated. How to effectively search logically similar motions from large data ...
Comments