Abstract
Existing sign language recognition (SLR) models have shown to lack precision in identifying a sign due to their inability to rationalize inter-class discriminations. Specifically, the trained SLR models are sensitive to small variations in hand movements and finger shapes across signs in a video sequence. To overcome the above problem, this work proposes to learn a class label by computing a metric variable that squeezes the displacement between within-class and across-class labels. Generally, metric learning is considerably slower than other deep SLR classification architectures on video data due to the triplet pairing process. In traditional triplet pairing, all frames in all classes participate during the training process in each episode. Contrastingly, this paper proposes a self-sourced singular pairing process between the anchor and positive frames along with an attention mechanism, resulting in Brisk Paired Deep Metric Learning (BPDMAL) model. The BPDMAL integrated with standard deep learning architectures is evaluated on our 2D video sign language dataset named KL2DSL and two other benchmark video-based sign language datasets. The proposed BPDMAL has improved performance over the traditional DML and state-of-the-art SLR Deep Learning Models with an incremental downfall in training and inferencing times making it a useful model for real-time deployment.
Similar content being viewed by others
Availability of Data and Materials
Data will be made available on reasonable request.
References
Koller O, Zargaran S, Ney H, Bowden R. Deep sign: enabling robust statistical continuous sign language recognition via hybrid CNN-HMMs. Int J Comput Vis. 2018;126(12):1311–25. https://doi.org/10.1007/s11263-018-1121-3.
Kumar EK, Kishore P, Sastry A, Kumar MTK, Kumar DA. Training cnns for 3-d sign language recognition with color texture coded joint angular displacement maps. IEEE Signal Process Lett. 2018;25(5):645–9.
Ayuningsih T, Suhendar A, Suyanto S. Feasibility study of artificial intelligence technology for home video surveillance system. In: 2022 1st International Conference on Information System and Information Technology (ICISIT). IEEE (2022). https://doi.org/10.1109/icisit54091.2022.9872822.
Ghosh I, Ramamurthy SR, Chakma A, Roy N. Sports analytics review: artificial intelligence applications, emerging technologies, and algorithmic perspective. WIREs Data Min Knowl Discov. 2023. https://doi.org/10.1002/widm.1496.
Wu J, Wang X, Dang Y, Lv Z. Digital twins and artificial intelligence in transportation infrastructure: classification, application, and future research directions. Comput Electr Eng. 2022;101: 107983. https://doi.org/10.1016/j.compeleceng.2022.107983.
Wu B, Lu Z, Yang C. A modified LSTM model for Chinese sign language recognition using leap motion. In: 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2022. https://doi.org/10.1109/smc53654.2022.9945287.
Rao GA, Syamala K, Kishore PVV, Sastry ASCS. Deep convolutional neural networks for sign language recognition. In: 2018 Conference on Signal Processing and Communication Engineering Systems (SPACES). IEEE, 2018. https://doi.org/10.1109/spaces.2018.8316344.
Ali SA, Prasad MVD, Kumar PP, Kishore PVV. Deep multi view spatio temporal spectral feature embedding on skeletal sign language videos for recognition. Int J Adv Comput Sci Appl. 2022. https://doi.org/10.14569/ijacsa.2022.0130494.
Kumar EK, Kishore PVV, Kumar MTK, Kumar DA, Sastry ASCS. Three-dimensional sign language recognition with angular velocity maps and connived feature ResNet. IEEE Signal Process Lett. 2018;25(12):1860–4. https://doi.org/10.1109/lsp.2018.2877891.
Maddala TKK, Kishore PVV, Eepuri KK, Dande AK. YogaNet: 3-d yoga asana recognition using joint angular displacement maps with ConvNets. IEEE Trans Multimed. 2019;21(10):2492–503. https://doi.org/10.1109/tmm.2019.2904880.
Nassif A.B, Shahin I, Attili I, Azzeh M, Shaalan K. Speech recognition using deep neural networks: a systematic review. IEEE Access. 2019;7:19143–65. https://doi.org/10.1109/access.2019.2896880.
Hoffer E, Ailon N. Deep metric learning using triplet network. In: Similarity-based pattern recognition. Cham: Springer; 2015. p. 84–92. https://doi.org/10.1007/978-3-319-24261-3_7.
Mopidevi S, Prasad MVD, Kishore PVV. Multiview meta-metric learning for sign language recognition using triplet loss embeddings. Pattern Anal Appl. 2023;26(3):1125–41. https://doi.org/10.1007/s10044-023-01134-2.
Yu J, Hu C-H, Jing X-Y, Feng Y-J. Deep metric learning with dynamic margin hard sampling loss for face verification. SIViP. 2019;14(4):791–8. https://doi.org/10.1007/s11760-019-01612-3.
Tubaiz N, Shanableh T, Assaleh K. Glove-based continuous Arabic sign language recognition in user-dependent mode. IEEE Trans Hum-Mach Syst. 2015;45(4):526–33. https://doi.org/10.1109/thms.2015.2406692.
Ayoub H, Grierson M. Hand gesture recognition and speech synthesis data glove for children with non-verbal disabilities 2020.
Raghuveera T, Deepthi R, Mangalashri R, Akshaya R. A depth-based Indian sign language recognition using Microsoft Kinect. Sādhanā. 2020. https://doi.org/10.1007/s12046-019-1250-6.
Kishore PVV, Kumar DA, Sastry ASCS, Kumar EK. Motionlets matching with adaptive kernels for 3-d Indian sign language recognition. IEEE Sens J. 2018;18(8):3327–37. https://doi.org/10.1109/jsen.2018.2810449.
Miah ASM, Hasan MAM, Shin J, Okuyama Y, Tomioka Y. Multistage spatial attention-based neural network for hand gesture recognition. Computers. 2023;12(1):13. https://doi.org/10.3390/computers12010013.
Chen N, Feng Z, Li F, Wang H, Yu R, Jiang J, Tang L, Rong P, Wang W. A fully automatic target detection and quantification strategy based on object detection convolutional neural network YOLOv3 for one-step x-ray image grading. Anal Methods. 2023;15(2):164–70. https://doi.org/10.1039/d2ay01526a.
Abu-Jamie TN, Abu-Naser SS. Classification of sign-language using vgg16 2022.
Kanchimani S, Suman M, Kishore PVV. Learning global average attention pooling (GAAP) on resnet50 backbone for person re-identification problem. Int J Adv Comput Sci Appl. 2022. https://doi.org/10.14569/ijacsa.2022.0130796.
Suresh AJ, Visumathi J. WITHDRAWN: Inception ResNet deep transfer learning model for human action recognition using LSTM. Mater Today: Proc. 2020. https://doi.org/10.1016/j.matpr.2020.09.609.
Koushik CVN, Tarun C, Kamal RVN, Anuradha T. Sign language interpreter using inception v2 and faster r-CNN. In: Lecture notes in electrical engineering. Cham: Springer; 2022. p. 771–81. https://doi.org/10.1007/978-981-19-2281-7_71.
Zhang S, Tong H, Xu J, Maciejewski R. Graph convolutional networks: a comprehensive review. Comput Soc Netw. 2019. https://doi.org/10.1186/s40649-019-0069-y.
Ulhaq A, Akhtar N, Pogrebna G, Mian A. Vision transformers for action recognition: a survey. 2022 arXiv preprint arXiv:2209.05700
Sincan OM, Tur AO, Keles HY. Isolated sign language recognition with multi-scale features using LSTM. In: 2019 27th Signal Processing and Communications Applications Conference (SIU). IEEE, 2019. https://doi.org/10.1109/siu.2019.8806467.
Wang Q, Lai J, Yang Z, Xu K, Kan P, Liu W, Lei L. Improving cross-dimensional weighting pooling with multi-scale feature fusion for image retrieval. Neurocomputing. 2019;363:17–26. https://doi.org/10.1016/j.neucom.2019.08.025.
Suneetha M, Prasad MVD, Kishore PVV. Sharable and unshareable within class multi view deep metric latent feature learning for video-based sign language recognition. Multimed Tools Appl. 2022;81(19):27247–73. https://doi.org/10.1007/s11042-022-12646-0.
Forster J, Schmidt C, Hoyoux T, Koller O, Zelle U, Piater JH, Ney H. Rwth-phoenix-weather: a large vocabulary sign language recognition and translation corpus. LREC. 2012;9:3785–9.
Camgoz NC, Hadfield S, Koller O, Ney H, Bowden R. Neural sign language translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018; 7784–7793.
Weinberger KQ, Saul LK. Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res. 2009;10(2):207–44.
Xu Z, Cao L, Chen X. Meta-learning via weighted gradient update. IEEE Access. 2019;7:110846–55.
Zhao W, Rao Y, Wang Z, Lu J, Zhou J. Towards interpretable deep metric learning with structural matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021; pp. 9887– 9896.
Duarte A, Palaskar S, Ventura L, Ghadiyaram D, DeHaan K, Metze F, Torres J, Giro-i-Nieto X. How2sign: a large-scale multimodal dataset for continuous american sign language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021; pp 2735– 2744.
Wojke N, Bewley A. Deep cosine metric learning for person re-identification. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2018; pp. 748– 756. IEEE.
Chen G, Zhang T, Lu J, Zhou J. Deep meta metric learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019; pp 9547– 9556.
He X, Zhou Y, Zhou Z, Bai S, Bai X. Triplet-center loss for multi-view 3d object retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018; pp 1945–1954.
Qu F, Liu J, Liu X, Jiang L. A multi-fault detection method with improved triplet loss based on hard sample mining. IEEE Trans Sustain Energy. 2020;12(1):127–37.
He Z, Jung C, Fu Q, Zhang Z. Deep feature embedding learning for person re-identification based on lifted structured loss. Multimed Tools Appl. 2019;78:5863–80.
Chen M, Ge Y, Feng X, Xu C, Yang D. Person re-identification by pose invariant deep metric learning with improved triplet loss. IEEE Access. 2018;6:68089–95.
Dong X, Shen J. Triplet loss in Siamese network for object tracking. In: Proceedings of the European Conference on Computer Vision (ECCV), 2018; pp 459–474.
Choi H, Som A, Turaga P. Amc-loss: angular margin contrastive loss for improved explainability in image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020; pp 838–839.
Zhong P, Wang D, Miao C. An affect-rich neural conversational model with biased attention and weighted cross-entropy loss. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019; pp 7492–7500.
Alvarez PC, Nieto XG, Benet LT. Sign language translation based on transformers for the how2sign dataset 2022.
Natarajan B, Elakkiya R, Prasad ML. Sentence2signgesture: a hybrid neural machine translation network for sign language video generation. J Ambient Intell Humaniz Comput. 2023;14(8):9807–21.
Kishore P, Prasad MV, Prasad CR, Rahul R. 4-camera model for sign language recognition using elliptical Fourier descriptors and ann. In: 2015 International Conference on Signal Processing and Communication Engineering Systems, 2015; pp 34– 38. IEEE.
Wang Q, Chen X, Zhang L-G, Wang C, Gao W. Viewpoint invariant sign language recognition. Comput Vis Image Underst. 2007;108(1–2):87–97.
Elons AS, Abull-Ela M, Tolba MF. A proposed pcnn features quality optimization technique for pose-invariant 3d Arabic sign language recognition. Appl Soft Comput. 2013;13(4):1646–60.
Ravi S, Suman M, Kishore P, Kumar K, Kumar A, et al. Multi modal spatio temporal co-trained cnns with single modal testing on rgb-d based sign language gesture recognition. J Comput Lang. 2019;52:88–102.
Liao Y, Xiong P, Min W, Min W, Lu J. Dynamic sign language recognition based on video sequence with blstm-3d residual networks. IEEE Access. 2019;7:38044–54.
Cui R, Liu H, Zhang C. Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017; pp 7361–7369.
Rastgoo R, Kiani K, Escalera S. Hand sign language recognition using multi-view hand skeleton. Expert Syst Appl. 2020;150: 113336.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The author(s) declare that they have no Conflict of Interests for this research in any form.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kishore, P.V.V., Anil Kumar, D. & Srinivasa Rao, K. Sign Language Recognition (SLR): A Brisk Paired Deep Metric Attention Learning (BPDMAL) Model for Video Data Applications. SN COMPUT. SCI. 5, 419 (2024). https://doi.org/10.1007/s42979-024-02793-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-024-02793-6