Abstract
Most of the existing dynamic sign language recognition methods based on deep learning directly use the video sequence or the whole sequence based on RGB information, not just the video sequence representing the change of gesture. These make it difficult for sign language recognition to achieve good accuracy. In order to solve these problems, this paper proposes a method of sign language recognition based on skeleton and SK3D-Residual network. In SK3D-Residual network, a key frame optimization algorithm for skeleton sequence based on mutual information is designed. The 3D-LSTM module extracts spatiotemporal features from the skeleton key frame sequences, analyzes the features of each action in the sequence, and then recognizes sign language. The experimental accuracy is 88.6%. In addition, the accuracy of the combination of RGB and skeleton information is 93.2%. Our experiment has achieved a good recognition accuracy.
Similar content being viewed by others
Data availability
No new datasets were generated in this paper. The datasets used for the experiments are available datasets.
References
Baribina N, Oks A, Baltina I, Katashev A, Emjonova G (2019) Development of pressure sensitive glove prototype. Key Eng Mater 800:326–330
Boulahia SY, Anquetil E, Multon F, Kulpa R (2017) Dynamic hand gesture recognition based on 3D pattern assembled trajectories. In: 2017 seventh international conference on image processing theory, tools and applications, pp. 1–6. https://doi.org/10.1109/IPTA.2017.8310146
Brock H, Law F, Nakadai K, Nagashima Y (2020) Learning three-dimensional skeleton data from sign language video. ACM Transactions on Intelligent Systems and Technology (TIST), 11(3):1–24
Chai X, Wang H, Chen X (2014) The devisign large vocabulary of chinese sign language database and baseline evaluations. In: Technical report VIPL-TR-14-SLR-001. Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS). Institute of Computing Technology
Chen X, Guo H, Wang G, Zhang L (2017) Motion feature augmented recurrent neural network for skeleton-based dynamic hand gesture recognition. In: 2017 IEEE International Conference on Image Processing, pp. 2881–2885. https://doi.org/10.1109/ICIP.2017.8296809
Chen X, Wang G, Guo H, Zhang C (2020) Pose guided structured region ensemble network for cascaded hand pose estimation. Neurocomputing 395:138–149
Du T, Ray J, Shou Z, Chang S, Paluar M (2017) ConvNet architecture search for spatiotemporal feature learning. ArXiv, vol. abs/1708.05038. https://arxiv.org/abs/1708.05038. Accessed 19 Jan 2022
Han J, Shao L, Xu D, Shotton J (2013) Enhanced computer vision with microsoft kinect sensor: a review. IEEE Trans Cybern 43:1318–1334
Hou J, Wang G, Chen X, Xue J, Zhu R, Yang H (2019) Spatial-Temporal attention res-TCN for skeleton-based dynamic hand gesture recognition. Lect Notes Comput Sci 11134:273–286
Huang X, Wang Q, Zang S, Wang J, Yang G, Huang Y, Ren X (2019) Tracing the motion of finger joints for gesture recognition via sewing rgo-coated fibers onto a textile glove. IEEE Sens J 19:9504–9511
Ionescu B, Coquin D, Lambert P, Buzuloiu V (2005) Dynamic hand gesture recognition using the skeleton of the hand. EURASIP J Adv Signal Process 13:2101–2109
Jiang L, Xia H, Guo C (2019) A model-based system for real-time articulated hand tracking using a simple data glove and a depth camera. Sensors 19(21):4680. https://doi.org/10.3390/s19214680
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732. https://doi.org/10.1109/CVPR.2014.223
Kim T, Keane J, Wang W, Tang H, Riggle J (2016) Lexicon-Free fingerspelling recognition from video: data, models, and signer adaptation. Comput Speech Lang 46:209–232
Kishore P, Kumar D, Goutham E, Manikanta M (2016) Continuous sign language recognition from tracking and shape features using fuzzy inference engine. In: 2016 International Conference on Wireless Communications, Signal Processing and Networking, pp. 2165–2170. https://doi.org/10.1109/WiSPNET.2016.7566526
Koller O, Ney H, Bowden R (2016) Deep hand: how to train a CNN on 1 million hand images when your data is continuous and weakly labelled. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3793–3802. https://doi.org/10.1109/CVPR.2016.412
Kopuklu O, Kose N, Rigoll G (2018) Motion Fused Frames: data level fusion strategy for hand gesture recognition. In: Conference on Computer Vision and Pattern Recognition Workshops, pp. 2184–21848. https://doi.org/10.1109/CVPRW.2018.00284
Liao Y, Xiong P, Min W, Lu J (2019) Dynamic sign language recognition based on video sequence with BLSTM-3D residual networks. IEEE Access 7:38044–38054
Lin Y, Chai X, Yu Z, Chen X (2015) Curve matching from the view of manifold for sign language recognition. Asian Conference on Computer Vision, 233–246
Moon G, Chang JY, Lee KM (2018) V2v-posenet: voxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 5079–5088. https://doi.org/10.1109/CVPR.2018.00533
Nunez I, Cabido R, Pantrigo J, Montemayor A, Velez J (2018) Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition. Pattern Recogn 76:80–94
Piergiovanni A, Fan C, Ryoo M (2017) Learning latent subevents in activity videos using temporal attention filters. In: Proceedings of the AAAI Conference on Artificial Intelligence, 31(1). https://doi.org/10.1609/aaai.v31i1.11240
Reddy S, Latha P, Babu M (2011) Hand gesture recognition using skeleton of hand and distance based metric. Adv Comput Inf Technol 198:346–354
Ryoo MS, Rothrock B, Matthies L (2015) Pooled motion features for first-person videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 896–904. https://doi.org/10.1109/CVPR.2015.7298691
Shou Z, Chan J, Zareian A, Miyazawa K, Chang S (2017) Cdc: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5734-5743. https://doi.org/10.48550/arXiv.1703.01515
Singh B, Marks TK, Jones M, Tuzel O, Shao M (2016) A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1961–1970. https://doi.org/10.1109/CVPR.2016.216
Smedt Q Wannous H, Vandeborre J (2016) Skeleton-based dynamic hand gesture recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–9. https://doi.org/10.1109/CVPRW.2016.153
Smedt Q, Wannous H, Vandeborre J (2017) SHREC17 Track: 3D hand gesture recognition using a depth and skeletal dataset. Eurographics Workshop on 3D Object Retrieval, pp. 33–38. https://doi.org/10.2312/3dor.20171049
Song W, Wang A, Chen Y, Bai S (2019) Design of a wearable smart sEMG recorder integrated gradient boosting decision tree based hand gesture recognition. IEEE Trans Biomed Circuits Syst 13:1563–1574
Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40:1510–1517
Wang C, Chan SC (2014) A new hand gesture recognition algorithm based on joint color-depth superpixel earth mover's distance. In: 2014 4th International Workshop on Cognitive Information Processing (CIP), pp. 1–6. https://doi.org/10.1109/TMM.2014.2374357
Wang H, Chai X, Chen X (2016) Sparse observation (SO) alignment for sign language recognition. Neurocomputing 175:674–685
Wang Z, Chen X, Guo H, Zhang C (2018) Region ensemble vetwork: towards good practices for deep 3D hand pose estimation. J Vis Commun Image Represent 55:404–414
Xiao Q, Qin M, Yin Y (2020) Skeleton-based chinese sign language recognition and generation for bidirectional communication between deaf and hearing people. Neural Netw 125:41–55
Xiong X, Min W, Zheng W, Liao P, Yao H, Wang S (2020) S3D-CNN: skeleton-based 3D consecutive-low-pooling neural network for fall detection. Appl Intell 50(10):3521–3534
Xiong X, Wu H, Min W, Xu J, Peng C (2021) Traffic police gesture recognition based on gesture skeleton extractor and multichannel dilated graph convolution network. Electronics 10:551
Xu H, Da S, Saenko K (2017) R-C3D: region convolutional 3D network for temporal activity detection. In: IEEE International Conference on Computer Vision, pp. 5783–5792. https://doi.org/10.48550/arXiv.1703.07814
Yang H, Liu L, Min W, Yang X, Xiong X (2020) Driver yawning detection based on subtle facial action recognition. IEEE Trans Multimedia 23:572–583
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Grant No. 62076117, No. 61762061 and No.62166026) and Jiangxi Key Laboratory of Smart City (Grant No. 20192BCD40002).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no conflict of interest regarding the publication of this paper.
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Han, Q., Huangfu, Z., Min, W. et al. Sign language recognition based on skeleton and SK3D-Residual network. Multimed Tools Appl 83, 18059–18072 (2024). https://doi.org/10.1007/s11042-023-16117-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16117-y