Abstract
Automatic Lip-Reading (ALR), also known as Visual Speech Recognition (VSR), is the technological process to extract and recognize speech content, based solely on the visual recognition of the speaker’s lip movements. Besides hearing-impaired people, regular hearing people also resort to visual cues for word disambiguation, every time one is in a noisy environment. Due to the increasingly interest in developing ALR systems, a considerable number of research articles are being published. This article selects, analyses, and summarizes the main papers from 2018 to early 2022, from traditional methods with handcrafted feature extraction algorithms to end-to-end deep learning based ALR which fully take advantage of learning the best features, and of the evergrowing publicly available databases. By providing a recent state-of-the-art overview, identifying trends, and presenting a conclusion on what is to be expected in future work, this article becomes an efficient way to update on the most relevant ALR techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Huang, X., Acero, A., Hon, H.-W.: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall PTR, Upper Saddle River (2001)
Das, S.K., Nandakishor, S., Pati, D.: Automatic lip contour extraction using pixel-based segmentation and piece-wise polynomial fitting. In: 2017 14th IEEE India Council International Conference (INDICON), Roorkee. IEEE, pp. 1–5 (2017). https://ieeexplore.ieee.org/document/8487538/
Bauman, N.: Speechreading (Lip-Reading) (2011). https://hearinglosshelp.com/blog/speechreading-lip-reading/
Petajan, E.D.: Automatic lipreading to enhance speech recognition. In: Degree of Doctor of Philosophy in Electrica l Engineering, University of Illinois, Urbana-Champaign (1984)
Huang, H., et al.: A novel machine lip reading model. Procedia Comput. Sci. 199, 1432–1437 (2022). https://linkinghub.elsevier.com/retrieve/pii/S187705092200182X
Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: LipNet: end-to-end sentence-level lipreading (2016). arXiv:1611.01599
Petridis, S., Wang, Y., Ma, P., Li, Z., Pantic, M.: End-to-end visual speech recognition for small-scale datasets (2019). arXiv Version Number: 4. https://arxiv.org/abs/1904.01954
Fung, I., Mak, B.: End-to-end low-resource lip-reading with maxout Cnn and Lstm. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB. IEEE, pp. 2511–2515 (2018). https://ieeexplore.ieee.org/document/8462280/
Prajwal, K.R., Afouras, T., Zisserman, A.: Sub-word level lip reading with visual attention (2021). arXiv:2110.07603
Fenghour, S., Chen, D., Guo, K., Li, B., Xiao, P.: Deep learning-based automated lip-reading: a survey. IEEE Access, 9 121184–121205 (2021). https://ieeexplore.ieee.org/document/9522117/
Hao, M., Mamut, M., Ubul, K.: A survey of lipreading methods based on deep learning. In: 2020 2nd International Conference on Image Processing and Machine Vision, Bangkok Thailand. ACM, pp. 31–39 (2020). https://dl.acm.org/doi/10.1145/3421558.3421563
Alam, M., Samad, M., Vidyaratne, L., Glandon, A., Iftekharuddin, K.: Survey on deep neural networks in speech and vision systems. Neurocomputing 417, 302–321 (2020). https://linkinghub.elsevier.com/retrieve/pii/S0925231220311619
Bhaskar, S., Thasleema, T.M., Rajesh, R.: A survey on different visual speech recognition techniques. In: Nagabhushan, P., Guru, D.S., Shekar, B.H., Kumar, Y.H.S. (eds.) Data Analytics and Learning. LNNS, vol. 43, pp. 307–316. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-2514-4_26
Fernandez-Lopez, A., Sukno, F.M.: Survey on automatic lip-reading in the era of deep learning. Image Vis. Comput. 78, 53–72 (2018). https://linkinghub.elsevier.com/retrieve/pii/S0262885618301276
Fernandez-Lopez, A., Martinez, O., Sukno, F.M.: Towards estimating the upper bound of visual-speech recognition: the visual lip-reading feasibility database. In: 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition, Washington, DC, USA. IEEE, pp. 208–215 (2017). http://ieeexplore.ieee.org/document/7961743/
Zhang, Y., Yang, S., Xiao, J., Shan, S., Chen, X.: Can we read speech beyond the lips? Rethinking RoI selection for deep visual speech recognition (2020). arXiv Version Number: 2. https://arxiv.org/abs/2003.03206
Lu, Y., Zhu, X., Xiao, K.: Unsupervised lip segmentation based on quad-tree MRF framework in wavelet domain. Measurement 141, 95–101 (2019). https://linkinghub.elsevier.com/retrieve/pii/S0263224119302180
Lu, Y., Liu, Q.: Lip segmentation using automatic selected initial contours based on localized active contour model. EURASIP J. Image Video Process. 2018(1), 7 (2018). https://jivp-eurasipjournals.springeropen.com/articles/10.1186/s13640-017-0243-9
Radha, N., Shahina, A., Khan, N.: Visual speech recognition using fusion of motion and geometric features. Procedia Comput. Sci. 171, 924–933 (2020). https://linkinghub.elsevier.com/retrieve/pii/S1877050920310760
Weng, X., Kitani, K.: Learning spatio-temporal features with two-stream deep 3D CNNs for lipreading (2019). arXiv:1905.02540. http://arxiv.org/abs/1905.02540
Lu, Y., Yan, J.: automatic lip reading using convolution neural network and bidirectional long short-term memory. Int. J. Pattern Recog. Artif. Intell. 34(01), 2054003 (2020). https://www.worldscientific.com/doi/abs/10.1142/S0218001420540038
Mesbah, A., Berrahou, A., Hammouchi, H., Berbia, H., Qjidaa, H., Daoudi, M.: Lip reading with Hahn convolutional neural networks. Image Vis. Comput. 88, 76–83 (2019). https://linkinghub.elsevier.com/retrieve/pii/S0262885619300605
Ma, X., Zhang, H., Li, Y.: Feature extraction method for lip-reading under variant lighting conditions. In: Proceedings of the 9th International Conference on Machine Learning and Computing, Singapore. ACM, pp. 320–326 (2017). https://dl.acm.org/doi/10.1145/3055635.3056576
Jeon, S., Elsharkawy, A., Kim, M.S.: Lipreading architecture based on multiple convolutional neural networks for sentence-level visual speech recognition. Sensors 22(1), 72 (2021). https://www.mdpi.com/1424-8220/22/1/72
Wang, C.: Multi-grained spatio-temporal modeling for lip-reading. arXiv Version Number: 2 (2019). https://arxiv.org/abs/1908.11618
Fenghour, S., Chen, D., Guo, K., Xiao, P.: Lip reading sentences using deep learning with only visual cues. IEEE Access, 8, 215 516–215 530 (2020). https://ieeexplore.ieee.org/document/9272286/
Fenghour, S., Chen, D., Guo, K., Li, B., Xiao, P.: An effective conversion of visemes to words for high-performance automatic lipreading. Sensors 21(23), 7890 (2021). https://www.mdpi.com/1424-8220/21/23/7890
Martinez, B., Ma, P., Petridis, S., Pantic, M.: Lipreading using temporal convolutional networks. arXiv Version Number: 1 (2020). https://arxiv.org/abs/2001.08702
Lu, Y., Li, H.: Automatic lip-reading system based on deep convolutional neural network and attention-based long short-term memory. Appl. Sci. 9(8), 1599 (2019). https://www.mdpi.com/2076-3417/9/8/1599
Afouras, T., Chung, J.S., Zisserman, A.: ASR is all you need: cross-modal distillation for lip reading (2020). arXiv:1911.12747 [cs, eess]. http://arxiv.org/abs/1911.12747
Gupta, A.K., Gupta, P., Rahtu, E.: FATALRead - fooling visual speech recognition models: put words on lips. Appl. Intell. (2021). https://link.springer.com/10.1007/s10489-021-02846-w
Acknowledgements
This work is funded by FCT/MEC through national funds and, when applicable, co-funded by the FEDER-PT2020 partnership agreement under the project UIDB/00308/2020.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
About this paper
Cite this paper
Santos, C., Cunha, A., Coelho, P. (2023). A Review on Deep Learning-Based Automatic Lipreading. In: Cunha, A., M. Garcia, N., Marx Gómez, J., Pereira, S. (eds) Wireless Mobile Communication and Healthcare. MobiHealth 2022. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 484. Springer, Cham. https://doi.org/10.1007/978-3-031-32029-3_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-32029-3_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-32028-6
Online ISBN: 978-3-031-32029-3
eBook Packages: Computer ScienceComputer Science (R0)