STSD: spatial–temporal semantic decomposition transformer for skeleton-based action recognition

Cui, Hu; Hayama, Tessai

doi:10.1007/s00530-023-01251-2

STSD: spatial–temporal semantic decomposition transformer for skeleton-based action recognition

Regular Paper
Published: 26 January 2024

Volume 30, article number 43, (2024)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Hu Cui¹ &
Tessai Hayama¹

152 Accesses
Explore all metrics

Abstract

Skeleton-based human action recognition has attracted widespread interest, as skeleton data are extremely robust to changes in lighting, camera views, and complex backgrounds. In recent studies, transformer-based methods are proposed for the encoding of the latent information underlying the 3D skeleton. These methods focus on modeling the relationships of joints in skeleton sequences without any predefined graphical information by self-attention mechanism and have been proven to be effective. But there are two challenging issues ignored in these methods: the utilization of human body-related and dynamic semantic information. In this work, we propose a novel spatial–temporal semantic decomposition transformer network (STSD-TR) that models dependencies between joints with body parts semantics and sub-action semantics. In our STSD-TR, a body parts semantic decomposition module (BPSD) is used to extract body parts semantic information from 3D coordinates of joints, and then a temporal-local spatial–temporal attention module (TL-STA) is used to capture the relationships of joints in several consecutive frames which can be understood as local sub-action semantic information. Finally, a global spatial–temporal module (GST) is used to aggregate the temporal-local features and generate a global spatial–temporal representation. Moreover, we design a BodyParts-Mix strategy which mixes body parts from two people in a unique manner and further boosts the performance. Compared with the state-of-the-art methods, our method achieves competitive performance on two large-scale datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual attention network

Article Open access 28 July 2023

Human Action Recognition and Prediction: A Survey

Article 28 March 2022

Human action recognition using fusion of multiview and deep features: an application to video surveillance

Article 14 March 2020

Availability of data and materials

All data generated or analyzed during this study are included in this published article.

References

Lee, J., Ahn, B.: Real-time human action recognition with a low-cost rgb camera and mobile robot platform. Sensors 20(10), 2886 (2020)
Article ADS PubMed PubMed Central Google Scholar
Sreenu, G., Durai, S.: Intelligent video surveillance: a review through deep learning techniques for crowd analysis. J Big Data 6(1), 1–27 (2019)
Article Google Scholar
Caraban, A., Karapanos, E., Gonçalves, D., Campos, P.: 23 ways to nudge: A review of technology-mediated nudging in human-computer interaction. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1–15 (2019)
Cui, H., Chang, C.: Deep learning based advanced spatio-temporal extraction model in medical sports rehabilitation for motion analysis and data processing. IEEE Access 8, 115848–115856 (2020)
Article Google Scholar
Tran, A., Cheong, L.-F.: Two-stream flow-guided convolutional attention networks for action recognition. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 3110–3119 (2017)
Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2012)
Article Google Scholar
Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W.: Action recognition in video sequences using deep bi-directional lstm with cnn features. IEEE Access 6, 1155–1166 (2017)
Article Google Scholar
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)
Sun, B., Ye, X., Yan, T., Wang, Z., Li, H., Wang, Z.: Fine-grained action recognition with robust motion representation decoupling and concentration. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 4779–4788 (2022)
Sun, B., Ye, X., Wang, Z., Li, H., Wang, Z.: Exploring coarse-to-fine action token localization and interaction for fine-grained video action recognition. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 5070–5078 (2023)
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-second AAAI Conference on Artificial Intelligence (2018)
Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., Lu, H.: Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 183–192 (2020)
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12026–12035 (2019)
Plizzari, C., Cannici, M., Matteucci, M.: Spatial temporal transformer network for skeleton-based action recognition. In: International Conference on Pattern Recognition, pp. 694–701 (2021). Springer
Xu, K., Ye, F., Zhong, Q., Xie, D.: Topology-aware convolutional neural network for efficient skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 2866–2874 (2022)
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Decoupled spatial-temporal attention network for skeleton-based action recognition. arXiv preprint arXiv:2007.03263 (2020)
Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2117–2126 (2017)
Si, C., Chen, W., Wang, W., Wang, L., Tan, T.: An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1227–1236 (2019)
Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: Learning clip representations for skeleton-based 3d action recognition. IEEE Trans. Image Process. 27(6), 2842–2855 (2018)
Article ADS MathSciNet PubMed Google Scholar
Liu, J., Wang, G., Duan, L.-Y., Abdiyeva, K., Kot, A.C.: Skeleton-based human action recognition with global context-aware attention lstm networks. IEEE Trans. Image Process. 27(4), 1586–1599 (2017)
Article ADS MathSciNet Google Scholar
Ye, F., Pu, S., Zhong, Q., Li, C., Xie, D., Tang, H.: Dynamic gcn: Context-enriched topology learning for skeleton-based action recognition. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 55–63 (2020)
Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 143–152 (2020)
Chen, Z., Li, S., Yang, B., Li, Q., Liu, H.: Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1113–1122 (2021)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Cho, S., Maqbool, M., Liu, F., Foroosh, H.: Self-attention network for skeleton-based human action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 635–644 (2020)
Lv, J., Gong, X.: Multi-grained temporal segmentation attention modeling for skeleton-based action recognition. IEEE Sig. Process. Lett. 30, 927–931 (2023). https://doi.org/10.1109/LSP.2023.3298286
Article ADS Google Scholar
Qiu, H., Hou, B., Ren, B., Zhang, X.: Spatio-temporal segments attention for skeleton-based action recognition. Neurocomputing 518, 30–38 (2023)
Article Google Scholar
Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: Regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6023–6032 (2019)
Shahroudy, A., Liu, J., Ng, T.-T., Wang, G.: Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)
Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.-Y., Kot, A.C.: Ntu rgb+ d 120: a large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2684–2701 (2019)
Article PubMed Google Scholar
Fernando, B., Gavves, E., Oramas, J.M., Ghodrati, A., Tuytelaars, T.: Modeling video evolution for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5378–5387 (2015)
Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 588–595 (2014)
Garcia-Hernando, G., Kim, T.-K.: Transition forests: Learning discriminative temporal transitions for action recognition and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 432–440 (2017)
Han, F., Reily, B., Hoff, W., Zhang, H.: Space-time representation of people based on 3d skeletal data: A review. Comput. Vis. Image Underst. 158, 85–105 (2017)
Article Google Scholar
Yu, G., Liu, Z., Yuan, J.: Discriminative orderlet mining for real-time recognition of human-object interaction. In: Asian Conference on Computer Vision, pp. 50–65 (2015). Springer
Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal lstm with trust gates for 3d human action recognition. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pp. 816–833 (2016). Springer
Cheng, K., Zhang, Y., He, X., Cheng, J., Lu, H.: Extremely lightweight skeleton-based action recognition with shiftgcn++. IEEE Trans. Image Process. 30, 7333–7348 (2021)
Article ADS PubMed Google Scholar
Gao, J., He, T., Zhou, X., Ge, S.: Skeleton-based action recognition with focusing-diffusion graph convolutional networks. IEEE Signal Process. Lett. 28, 2058–2062 (2021)
Article ADS Google Scholar
Xia, R., Li, Y., Luo, W.: Laga-net: local-and-global attention network for skeleton based action recognition. IEEE Trans. Multimed. 24, 2648–2661 (2021)
Article Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inform. Process. Syst. 30, 6000–6010 (2017)
Google Scholar
Bello, I., Zoph, B., Vaswani, A., Shlens, J., Le, Q.V.: Attention augmented convolutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3286–3295 (2019)
Zhao, H., Jia, J., Koltun, V.: Exploring self-attention for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10076–10085 (2020)
Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., Shlens, J.: Stand-alone self-attention in vision models. Adv. Neural Inform. Process. Syst. 32, 68–80 (2019)
Google Scholar
Srinivas, A., Lin, T.-Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16519–16529 (2021)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pp. 3163–3172 (2021)
Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., Chen, L.-C.: Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In: European Conference on Computer Vision, pp. 108–126 (2020). Springer
DeVries, T., Taylor, G.W.: Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017)
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
Li, C., Zhong, Q., Xie, D., Pu, S.: Skeleton-based action recognition with convolutional neural networks. In: 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 597–600 (2017). IEEE
Li, C., Zhong, Q., Xie, D., Pu, S.: Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv preprint arXiv:1804.06055 (2018)
Li, S., Li, W., Cook, C., Zhu, C., Gao, Y.: Independently recurrent neural network (indrnn): Building a longer and deeper rnn. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5457–5466 (2018)
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7912–7921 (2019)
Liu, Y., Zhang, H., Xu, D., He, K.: Graph transformer network with temporal kernel attention for skeleton-based action recognition. Knowl.-Based Syst. 240, 108146 (2022)
Article Google Scholar
Kong, J., Bian, Y., Jiang, M.: Mtt: Multi-scale temporal transformer for skeleton-based action recognition. IEEE Signal Process. Lett. 29, 528–532 (2022)
Article ADS Google Scholar
Jiang, Y., Sun, Z., Yu, S., Wang, S., Song, Y.: A graph skeleton transformer network for action recognition. Symmetry 14(8), 1547 (2022)
Article ADS Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information and Management Systems Engineering, Nagaoka University of Technology, 1603-1 Kamitiomioka, Nagaoka, Niigata, 9402188, Japan
Hu Cui & Tessai Hayama

Authors

Hu Cui
View author publications
You can also search for this author in PubMed Google Scholar
Tessai Hayama
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tessai Hayama.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Communicated by H. Li.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Cui, H., Hayama, T. STSD: spatial–temporal semantic decomposition transformer for skeleton-based action recognition. Multimedia Systems 30, 43 (2024). https://doi.org/10.1007/s00530-023-01251-2

Download citation

Received: 12 May 2023
Accepted: 21 December 2023
Published: 26 January 2024
DOI: https://doi.org/10.1007/s00530-023-01251-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

STSD: spatial–temporal semantic decomposition transformer for skeleton-based action recognition

Abstract

Access this article

Similar content being viewed by others

Visual attention network

Human Action Recognition and Prediction: A Survey

Human action recognition using fusion of multiview and deep features: an application to video surveillance

Availability of data and materials

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

STSD: spatial–temporal semantic decomposition transformer for skeleton-based action recognition

Abstract

Access this article

Similar content being viewed by others

Visual attention network

Human Action Recognition and Prediction: A Survey

Human action recognition using fusion of multiview and deep features: an application to video surveillance

Availability of data and materials

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation