Abstract
This work presents a scalable solution to speaker-dependent visual command recognition in vehicle cabin. The goal of this work is to recognize a limited number of most frequent driver’s requests based on his/her lip movements. Unlike previous works that have focused on automated lip-reading in controlled laboratory environment, we tackle this problem in real driving conditions based on the recorded RUSAVIC dataset. Due to limiting the scope of the task to speaker-dependency and vocabulary of 50 phrases, the models that we train surpass the performance of previous work and can be used in real-life speech recognition applications. To achieve this, we constructed end-to-end methodology that require only 10 repetition of each phrase in order to achieve reasonable recognition accuracy up to 54% based purely on video information. Our key contributions are: (1) we introduce a novel approach to visual speech data preprocessing and labeling, designed to tackle real-life drivers data from vehicle cabin; (2) we investigate to what extent lip-reading is complimentary to improve visual command recognition, depending on the set of recognizable commands; (3) we train, adapted for our task and compare three state-of-the-art CNN architectures, namely MobileNetV2, DenseNet121, NASNetMobile to evaluate the performance of developed system. The proposed system achieved word recognition rate (WRR) of 55% for a vehicle parked at the crossroad task and 54% for driving scenarios.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Schroeder, P., Meyers, M., Kostuniuk, L.: National survey on distracted driving attitudes and behaviors, Report No. DOT HS 811 729. National Highway Traffic Safety Administration, Washington, DC (2019)
McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)
Shillingford, B., Assael, Y., Hoffman, M., et al.: Large-Scale Visual Speech Recognition. In: arXiv eprint 1807.05162, pp. 1–21 (2018)
Chen, X., Du, J., Zhang, H.: Lipreading with DenseNet and resBi-LSTM. Signal Image Video Process 14, 981–989 (2020)
Afouras, T., Chung, J.C., Senior, A., et al.: Deep Audio-visual Speech Recognition. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–13 (2018)
Kashevnik, A., et al.: Multimodal corpus design for audio-visual speech recognition in vehicle cabin. IEEE Access 9, 34986–35003 (2021)
Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P.: Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Trans. Audio Speech Lang. Process. 17(3), 423–435 (2009)
Gurban, M., Thiran, J.P.: Information theoretic feature extraction for audio-visual speech recognition. IEEE Trans. Signal Process. 57, 4765–4776 (2009)
Ivanko, D., et al.: Using a high-speed video camera for robust audio-visual speech recognition in acoustically noisy conditions. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 757–766. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66429-3_76
Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimedia 11(7), 1254–1265 (2009)
Potamianos, G., Graf, H.P., Cosatto, E.: An image transform approach for hmm based automatic lipreading. In: IEEE Conference on Image Processing, pp. 173–177 (1998)
Ivanko, D., et al.: Multimodal speech recognition: increasing accuracy using high speed video data. J. Multimodal User Interfaces 12(4), 319–328 (2018). https://doi.org/10.1007/s12193-018-0267-1
Zhou, Z., Zhao, G., Hong, X., Pietikainen, M.: A review of recent advances in visual speech decoding. Image Vis. Comput. 32, 590–605 (2014)
Assael, Y., Shillingford, B., Whiteson, S., Freitas, N.: LipNet: end-to-end sentence-level lipreading. In: GPU Technology Conference, pp. 1–14 (2017)
Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Asian Conference on Computer Vision, pp. 87–103 (2016)
Wand, M., Schmidhuber, J.: Improving speaker-independent lipreading with domain adversarial training. Interspeech 2017, 3982–3987 (2017)
Kagirov, I., Ryumin, D., Axyonov, A.: Method for multimodal recognition of one-handed sign language gestures through 3D convolution and LSTM neural networks. In: International Conference on Speech and Computer, pp. 191–200 (2019)
Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading. In: Interspeech, pp. 3652–3656, ISCA (2017)
Wand, M., Koutnik, K., Schmidhuber, J.: Lipreading with long short-term memory. International Conference on Acoustics, Speech, and Signal Processing, pp. 6115–6119 (2016)
Petridis, S., Wang, Y., Li, Z., Pantic, M.: End-to-end multi-view lipreading. In: British Machine Vision Conference, pp. 1–14 (2017)
Sui, S., Bennamoun, M., Togneri, R.: Listening with your eyes: towards a practical visual speech recognition system using deep Boltzmann machines. In: ICCV, pp. 154–162 (2015)
Noda, K., Yamaguchi, Y., Nakadai, K., et al.: Lipreading using convolutional neural network. In: Interspeech, pp. 1149–1153 (2014)
Almajai, I., Cox, S., Harvey, R., Lan, Y.: Improved speaker independent lip reading using speaker adaptive training and deep neural networks. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 2722–2726 (2016)
Newman, J., Cox, S.: Language identification using visual features. In: Proc. IEEE Audio Speech and Language Processing, vol. 20, no. 7, pp. 1936–1947 (2012)
Lan, Y., Theobald, B., Harvey, R.: View independent computer lip-reading. In: Proc. International Conference Multimedia Expo (ICME), pp. 432–437 (2012)
Estellers, V., Thiran, J.: Multi-pose lipreading and audio-visual speech recognition. In: EURALISP Journal Advanced Signal Processing, vol. 51 (2012)
Huang, Z., Zeng, Z., Liu, B., et al.: Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. In: arXiv: 2004.00849, pp. 1–17 (2020)
Ivanko, D., Ryumin, D., Axyonov, A., Železný, M.: Designing advanced geometric features for automatic Russian visual speech recognition. In: Karpov, A., Jokisch, O., Potapova, R. (eds.) SPECOM 2018. LNCS (LNAI), vol. 11096, pp. 245–254. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99579-3_26
Rajagopal, A., et al.: A deep learning model based on multi-objective particle swarm optimization for scene classification in unmanned aerial vehicles. IEEE Access 8, 135383–135393 (2020)
Ivanko, D., Ryumin, D., Kipyatkova, I., et al.: Lip-Reading Using Pixel-Based and Geometry-Based Features for Multimodal Human-Robot Interfaces. Smart Innovation, Systems and Technologies, vol. 154, pp. 477–486. Springer, Singapore (2020)
Xu, K., Li, D., Cassimatis, N., Wang, X. LCANet: End-to-end lipreading with cascaded attention-ctc. arXiv preprint arXiv:1803.04988, pp. 1–10 (2018)
Ryumina, E., Karpov, A.: Facial expression recognition using distance importance scores between facial landmarks. CEUR Workshop Proceedings 2744, 1–10 (2020)
Thanda, A., Venkatesan, S.M.: Audio visual speech recognition using deep recurrent neural networks. In: Multimodal Pattern Recognition of Social Signals in Human-Computer-Interaction, pp. 98–109 (2017)
Ryumina, E., Ryumin, D., Ivanko, D., Karpov, A.: A Novel Method for Protective Face Mask Detection Using Convolutional Neural Networks and Image Histograms. ISPRS-International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLIV-2/W1-2021, 177–182 (2021)
Lee, B., et al.: AVICAR: Audio-Visual Speech Corpus in a Car Environment. In: 8th International Conference on Spoken Language Processing, ICSLP 2004, pp. 1–5 (2004)
Ortega, A. et al.: AV@CAR: A Spanish multichannel multimodal corpus for in-vehicle automatic audio-visual speech recognition. In: LREC, pp. 1354–1359 (2004)
Miloš, M., Milošželezny, M., Císař, P.: Czech Audio-Visual Speech Corpus of a Car Driver for In-Vehicle Audio-Visual Speech Recognition. In: International Conference on Audio-Visual Speech Processing, AVSP, pp. 1–5 (2003)
Kawasaki, T., et al.: An audio-visual in-car corpus “CENSREC-2-AV” for robust bimodal speech recognition. In: Vehicle Systems and Driver Modelling, pp. 181–190 (2017)
Vosk offiline speech recognition API Kaldi based, [online] Available: https://alphacephei.com/vosk/
Kartynnik, Y., Ablavatski, A., Grishchenko, I., Grundmann, M.: Real-time Facial Surface Geometry from Monocular Video on Mobile GPUs. In: CVPR Workshop on Computer Vision for Augmented and Virtual Reality 2019, IEEE, pp. 1–4 (2019)
Howard, A., Zhu, M., Chen, B., et al.: MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. In arXiv:1704.04861, pp. 1–9 (2017)
Huang, G., Liu, Z., et al.: Densely Connected Convolutional Networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269 (2018)
Barret, Z., Vijay, V., Jonathon, S., Quoc, L.: Learning Transferable Architectures for Scalable Image Recognition. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8697–8710 (2018)
Acknowledgments
This research is financially supported by the Russian Foundation for Basic Research (project No. 19-29-09081 мк).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Ivanko, D., Ryumin, D., Axyonov, A., Kashevnik, A. (2021). Speaker-Dependent Visual Command Recognition in Vehicle Cabin: Methodology and Evaluation. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_27
Download citation
DOI: https://doi.org/10.1007/978-3-030-87802-3_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87801-6
Online ISBN: 978-3-030-87802-3
eBook Packages: Computer ScienceComputer Science (R0)