A Review on Deep Learning-Based Automatic Lipreading

Santos, Carlos; Cunha, António; Coelho, Paulo

doi:10.1007/978-3-031-32029-3_17

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 484))

Included in the following conference series:

International Conference on Wireless Mobile Communication and Healthcare

306 Accesses

Abstract

Automatic Lip-Reading (ALR), also known as Visual Speech Recognition (VSR), is the technological process to extract and recognize speech content, based solely on the visual recognition of the speaker’s lip movements. Besides hearing-impaired people, regular hearing people also resort to visual cues for word disambiguation, every time one is in a noisy environment. Due to the increasingly interest in developing ALR systems, a considerable number of research articles are being published. This article selects, analyses, and summarizes the main papers from 2018 to early 2022, from traditional methods with handcrafted feature extraction algorithms to end-to-end deep learning based ALR which fully take advantage of learning the best features, and of the evergrowing publicly available databases. By providing a recent state-of-the-art overview, identifying trends, and presenting a conclusion on what is to be expected in future work, this article becomes an efficient way to update on the most relevant ALR techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Huang, X., Acero, A., Hon, H.-W.: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall PTR, Upper Saddle River (2001)
Google Scholar
Das, S.K., Nandakishor, S., Pati, D.: Automatic lip contour extraction using pixel-based segmentation and piece-wise polynomial fitting. In: 2017 14th IEEE India Council International Conference (INDICON), Roorkee. IEEE, pp. 1–5 (2017). https://ieeexplore.ieee.org/document/8487538/
Bauman, N.: Speechreading (Lip-Reading) (2011). https://hearinglosshelp.com/blog/speechreading-lip-reading/
Petajan, E.D.: Automatic lipreading to enhance speech recognition. In: Degree of Doctor of Philosophy in Electrica l Engineering, University of Illinois, Urbana-Champaign (1984)
Google Scholar
Huang, H., et al.: A novel machine lip reading model. Procedia Comput. Sci. 199, 1432–1437 (2022). https://linkinghub.elsevier.com/retrieve/pii/S187705092200182X
Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: LipNet: end-to-end sentence-level lipreading (2016). arXiv:1611.01599
Petridis, S., Wang, Y., Ma, P., Li, Z., Pantic, M.: End-to-end visual speech recognition for small-scale datasets (2019). arXiv Version Number: 4. https://arxiv.org/abs/1904.01954
Fung, I., Mak, B.: End-to-end low-resource lip-reading with maxout Cnn and Lstm. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB. IEEE, pp. 2511–2515 (2018). https://ieeexplore.ieee.org/document/8462280/
Prajwal, K.R., Afouras, T., Zisserman, A.: Sub-word level lip reading with visual attention (2021). arXiv:2110.07603
Fenghour, S., Chen, D., Guo, K., Li, B., Xiao, P.: Deep learning-based automated lip-reading: a survey. IEEE Access, 9 121184–121205 (2021). https://ieeexplore.ieee.org/document/9522117/
Hao, M., Mamut, M., Ubul, K.: A survey of lipreading methods based on deep learning. In: 2020 2nd International Conference on Image Processing and Machine Vision, Bangkok Thailand. ACM, pp. 31–39 (2020). https://dl.acm.org/doi/10.1145/3421558.3421563
Alam, M., Samad, M., Vidyaratne, L., Glandon, A., Iftekharuddin, K.: Survey on deep neural networks in speech and vision systems. Neurocomputing 417, 302–321 (2020). https://linkinghub.elsevier.com/retrieve/pii/S0925231220311619
Bhaskar, S., Thasleema, T.M., Rajesh, R.: A survey on different visual speech recognition techniques. In: Nagabhushan, P., Guru, D.S., Shekar, B.H., Kumar, Y.H.S. (eds.) Data Analytics and Learning. LNNS, vol. 43, pp. 307–316. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-2514-4_26
Chapter Google Scholar
Fernandez-Lopez, A., Sukno, F.M.: Survey on automatic lip-reading in the era of deep learning. Image Vis. Comput. 78, 53–72 (2018). https://linkinghub.elsevier.com/retrieve/pii/S0262885618301276
Fernandez-Lopez, A., Martinez, O., Sukno, F.M.: Towards estimating the upper bound of visual-speech recognition: the visual lip-reading feasibility database. In: 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition, Washington, DC, USA. IEEE, pp. 208–215 (2017). http://ieeexplore.ieee.org/document/7961743/
Zhang, Y., Yang, S., Xiao, J., Shan, S., Chen, X.: Can we read speech beyond the lips? Rethinking RoI selection for deep visual speech recognition (2020). arXiv Version Number: 2. https://arxiv.org/abs/2003.03206
Lu, Y., Zhu, X., Xiao, K.: Unsupervised lip segmentation based on quad-tree MRF framework in wavelet domain. Measurement 141, 95–101 (2019). https://linkinghub.elsevier.com/retrieve/pii/S0263224119302180
Lu, Y., Liu, Q.: Lip segmentation using automatic selected initial contours based on localized active contour model. EURASIP J. Image Video Process. 2018(1), 7 (2018). https://jivp-eurasipjournals.springeropen.com/articles/10.1186/s13640-017-0243-9
Radha, N., Shahina, A., Khan, N.: Visual speech recognition using fusion of motion and geometric features. Procedia Comput. Sci. 171, 924–933 (2020). https://linkinghub.elsevier.com/retrieve/pii/S1877050920310760
Weng, X., Kitani, K.: Learning spatio-temporal features with two-stream deep 3D CNNs for lipreading (2019). arXiv:1905.02540. http://arxiv.org/abs/1905.02540
Lu, Y., Yan, J.: automatic lip reading using convolution neural network and bidirectional long short-term memory. Int. J. Pattern Recog. Artif. Intell. 34(01), 2054003 (2020). https://www.worldscientific.com/doi/abs/10.1142/S0218001420540038
Mesbah, A., Berrahou, A., Hammouchi, H., Berbia, H., Qjidaa, H., Daoudi, M.: Lip reading with Hahn convolutional neural networks. Image Vis. Comput. 88, 76–83 (2019). https://linkinghub.elsevier.com/retrieve/pii/S0262885619300605
Ma, X., Zhang, H., Li, Y.: Feature extraction method for lip-reading under variant lighting conditions. In: Proceedings of the 9th International Conference on Machine Learning and Computing, Singapore. ACM, pp. 320–326 (2017). https://dl.acm.org/doi/10.1145/3055635.3056576
Jeon, S., Elsharkawy, A., Kim, M.S.: Lipreading architecture based on multiple convolutional neural networks for sentence-level visual speech recognition. Sensors 22(1), 72 (2021). https://www.mdpi.com/1424-8220/22/1/72
Wang, C.: Multi-grained spatio-temporal modeling for lip-reading. arXiv Version Number: 2 (2019). https://arxiv.org/abs/1908.11618
Fenghour, S., Chen, D., Guo, K., Xiao, P.: Lip reading sentences using deep learning with only visual cues. IEEE Access, 8, 215 516–215 530 (2020). https://ieeexplore.ieee.org/document/9272286/
Fenghour, S., Chen, D., Guo, K., Li, B., Xiao, P.: An effective conversion of visemes to words for high-performance automatic lipreading. Sensors 21(23), 7890 (2021). https://www.mdpi.com/1424-8220/21/23/7890
Martinez, B., Ma, P., Petridis, S., Pantic, M.: Lipreading using temporal convolutional networks. arXiv Version Number: 1 (2020). https://arxiv.org/abs/2001.08702
Lu, Y., Li, H.: Automatic lip-reading system based on deep convolutional neural network and attention-based long short-term memory. Appl. Sci. 9(8), 1599 (2019). https://www.mdpi.com/2076-3417/9/8/1599
Afouras, T., Chung, J.S., Zisserman, A.: ASR is all you need: cross-modal distillation for lip reading (2020). arXiv:1911.12747 [cs, eess]. http://arxiv.org/abs/1911.12747
Gupta, A.K., Gupta, P., Rahtu, E.: FATALRead - fooling visual speech recognition models: put words on lips. Appl. Intell. (2021). https://link.springer.com/10.1007/s10489-021-02846-w

Download references

Acknowledgements

This work is funded by FCT/MEC through national funds and, when applicable, co-funded by the FEDER-PT2020 partnership agreement under the project UIDB/00308/2020.

Author information

Authors and Affiliations

School of Technology and Management, Polytechnic of Leiria, 2411-901, Leiria, Portugal
Carlos Santos & Paulo Coelho
Escola de Ciências e Tecnologias, University of Trás-os-Montes e Alto Douro, Quinta de Prados, 5001-801, Vila Real, Portugal
António Cunha
Institute for Systems and Computer Engineering, Technology and Science (INESC TEC), 4200-465, Porto, Portugal
António Cunha
Institute for Systems Engineering and Computers at Coimbra (INESC Coimbra), DEEC, Pólo II, 3030-290, Coimbra, Portugal
Paulo Coelho

Authors

Carlos Santos
View author publications
You can also search for this author in PubMed Google Scholar
António Cunha
View author publications
You can also search for this author in PubMed Google Scholar
Paulo Coelho
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paulo Coelho .

Editor information

Editors and Affiliations

University of Trás-os-Montes and Alto Douro, Vila Real, Portugal
António Cunha
University of Beira Interior, Covilha, Portugal
Nuno M. Garcia
Ossietzky Universität Oldenburg, Oldenburg, Niedersachsen, Germany
Jorge Marx Gómez
University of Trás-os-Montes and Alto Douro, Vila Real, Portugal
Sandra Pereira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Santos, C., Cunha, A., Coelho, P. (2023). A Review on Deep Learning-Based Automatic Lipreading. In: Cunha, A., M. Garcia, N., Marx Gómez, J., Pereira, S. (eds) Wireless Mobile Communication and Healthcare. MobiHealth 2022. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 484. Springer, Cham. https://doi.org/10.1007/978-3-031-32029-3_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-32029-3_17
Published: 14 May 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-32028-6
Online ISBN: 978-3-031-32029-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Review on Deep Learning-Based Automatic Lipreading