Multi-modal Stream Fusion for Skeleton-Based Action Recognition

Pang, Ruixuan; Li, Rongchang; Xu, Tianyang; Song, Xiaoning; Wu, Xiao-Jun

doi:10.1007/978-3-031-46311-2_19

Ruixuan Pang¹⁴,
Rongchang Li¹⁴,
Tianyang Xu¹⁴,
Xiaoning Song¹⁴ &
…
Xiao-Jun Wu¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14357))

Included in the following conference series:

International Conference on Image and Graphics

376 Accesses

Abstract

In recent years, graph convolution networks (GCN) have been widely used in skeleton-based action recognition to pursue higher accuracy. In general, traditional approaches directly integrate different modalities together using uniform fusion weights, which results in inadequate information fusion across modalities, sacrificing flexibility and robustness. In this paper, we explore the potential of adaptively fusing different modalities, and deliver a new fusion algorithm, coined Multi-modal Stream Fusion GCN (MSF-GCN). In principle, our proposed algorithm consists of three branches: JS-GCN, BS-GCN, and MS-GCN, corresponding to joint, bone, and motion modeling, respectively. In our design, the motion patterns between the joint and bone modalities are dynamically fused using an MLP layer. After conducting typical motion modeling, the static joint and bone branches are accompanied to perform the final fusion for the category predictions. Our MSF-GCN emphasizes static and dynamic fusion simultaneously, which greatly increases the interaction degree between the information of each modality, with improved flexibility. The proposed fusion strategy is applicable to different backbones, exhibiting the power to boost performance with marginal computation increase. Extensive experiments on a widely-used NTU-RGB+D dataset demonstrate that our model can achieve better or comparable results to current solutions, reflecting the merit of our fusion strategy.

This work is supported in part by the National Natural Science Foundation of China (Grant No. 62106089, 62020106012).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. arXiv preprint arXiv:1801.07455 (2018)
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12018–12027 (2019)
Google Scholar
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
Google Scholar
Shahroudy, A., Liu, J., Ng, T.-T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)
Google Scholar
Liu, J., Shahroudy, A., Perez, M.L., Wang, G., Duan, L.-Y., Chichung, A.K.: NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. (2019)
Google Scholar
Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3D skeletons as points in a lie group. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 588–595 (2014)
Google Scholar
Fernando, B., Gavves, E., Oramas, J., Ghodrati, A., Tuytelaars, T.: Modeling video evolution for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5378–5387 (2015)
Google Scholar
Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1110–1118 (2015)
Google Scholar
Li, W., Wen, L., Chang, M.-C., Lim, S.-N., Lyu, S.: Adaptive RNN tree for large-scale human action recognition. In: IEEE International Conference on Computer Vision, pp. 1453–1461(2017)
Google Scholar
Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 816–833. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_50
Chapter Google Scholar
Liu, H., Tu, J., Liu, M.: Two-stream 3D convolutional neural network for skeleton-based action recognition. arXiv preprint arXiv:1705.08106 (2017)
Kim, T.S., Reiter, A.: Interpretable 3D human action analysis with temporal convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1623–1631 (2017)
Google Scholar
Ke, Q., Bennamoun, S.A., Sohel, F., Boussaïd, F.: A new representation of skeleton sequences for 3D action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4570–4579 (2017)
Google Scholar
Liu, M., Liu, H., Chen, C.: Enhanced skeleton visualization for view invariant human action recognition. Pattern Recogn. 68, 346–362 (2017). The Journal of the Pattern Recognition Society
Google Scholar
Xu, T., Feng, Z., Wu, X.-J., Kittler, J.: Toward robust visual object tracking with independent target-agnostic detection and effective Siamese cross-task interaction. IEEE Trans. Image Process. 32, 1541–1554 (2023)
Article Google Scholar
Chen, Y., Zhang, Z., Yuan, C., Li, B., Deng, Y., Hu, W.: Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13339–13348(2021)
Google Scholar
Zhang, P., Lan, C., Zeng, W., Xing, J., Xue, J., Zheng, N.: Semantics-guided neural networks for efficient skeleton-based human action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1112–1121 (2020)
Google Scholar
Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., Lu, H.: Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 183–192 (2020)
Google Scholar
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Decoupled spatial-temporal attention network for skeleton-based action recognition. arXiv:abs/2007.03263 (2020)
Cheng, K., Zhang, Y., Cao, C., Shi, L., Cheng, J., Lu, H.: Decoupling GCN with DropGraph module for skeleton-based action recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12369, pp. 536–553. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58586-0_32
Chapter Google Scholar
Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 143–152 (2020)
Google Scholar
Ye, F., Pu, S., Zhong, Q., Li, C., Xie, D., Tang, H.: Dynamic GCN: context-enriched topology learning for skeleton-based action recognition. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 55–63 (2020)
Google Scholar
Wang, S., Zhang, Y., Wei, F., Wang, K., Zhao, M., Jiang, Y.: Skeleton-based action recognition via temporal-channel aggregation. arXiv:abs/2205.15936 (2022)

Download references

Author information

Authors and Affiliations

Jiangnan University, Wuxi, 214122, China
Ruixuan Pang, Rongchang Li, Tianyang Xu, Xiaoning Song & Xiao-Jun Wu

Authors

Ruixuan Pang
View author publications
You can also search for this author in PubMed Google Scholar
Rongchang Li
View author publications
You can also search for this author in PubMed Google Scholar
Tianyang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoning Song
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Jun Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiao-Jun Wu .

Editor information

Editors and Affiliations

Dalian University of Technology, Dalian, China
Huchuan Lu
University of Sydney, Sydney, NSW, Australia
Wanli Ouyang
Shenzhen University, Shenzhen, China
Hui Huang
Tsinghua University, Beijing, China
Jiwen Lu
Dalian University of Technology, Dalian, China
Risheng Liu
Institute of Automation, CAS, Beijing, China
Jing Dong
University of Technology Sydney, Sydney, NSW, Australia
Min Xu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pang, R., Li, R., Xu, T., Song, X., Wu, XJ. (2023). Multi-modal Stream Fusion for Skeleton-Based Action Recognition. In: Lu, H., et al. Image and Graphics . ICIG 2023. Lecture Notes in Computer Science, vol 14357. Springer, Cham. https://doi.org/10.1007/978-3-031-46311-2_19

Download citation

DOI: https://doi.org/10.1007/978-3-031-46311-2_19
Published: 29 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46310-5
Online ISBN: 978-3-031-46311-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics