Skip to main content

A Review on Deep Learning-Based Automatic Lipreading

  • Conference paper
  • First Online:
Wireless Mobile Communication and Healthcare (MobiHealth 2022)

Abstract

Automatic Lip-Reading (ALR), also known as Visual Speech Recognition (VSR), is the technological process to extract and recognize speech content, based solely on the visual recognition of the speaker’s lip movements. Besides hearing-impaired people, regular hearing people also resort to visual cues for word disambiguation, every time one is in a noisy environment. Due to the increasingly interest in developing ALR systems, a considerable number of research articles are being published. This article selects, analyses, and summarizes the main papers from 2018 to early 2022, from traditional methods with handcrafted feature extraction algorithms to end-to-end deep learning based ALR which fully take advantage of learning the best features, and of the evergrowing publicly available databases. By providing a recent state-of-the-art overview, identifying trends, and presenting a conclusion on what is to be expected in future work, this article becomes an efficient way to update on the most relevant ALR techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Huang, X., Acero, A., Hon, H.-W.: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall PTR, Upper Saddle River (2001)

    Google Scholar 

  2. Das, S.K., Nandakishor, S., Pati, D.: Automatic lip contour extraction using pixel-based segmentation and piece-wise polynomial fitting. In: 2017 14th IEEE India Council International Conference (INDICON), Roorkee. IEEE, pp. 1–5 (2017). https://ieeexplore.ieee.org/document/8487538/

  3. Bauman, N.: Speechreading (Lip-Reading) (2011). https://hearinglosshelp.com/blog/speechreading-lip-reading/

  4. Petajan, E.D.: Automatic lipreading to enhance speech recognition. In: Degree of Doctor of Philosophy in Electrica l Engineering, University of Illinois, Urbana-Champaign (1984)

    Google Scholar 

  5. Huang, H., et al.: A novel machine lip reading model. Procedia Comput. Sci. 199, 1432–1437 (2022). https://linkinghub.elsevier.com/retrieve/pii/S187705092200182X

  6. Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: LipNet: end-to-end sentence-level lipreading (2016). arXiv:1611.01599

  7. Petridis, S., Wang, Y., Ma, P., Li, Z., Pantic, M.: End-to-end visual speech recognition for small-scale datasets (2019). arXiv Version Number: 4. https://arxiv.org/abs/1904.01954

  8. Fung, I., Mak, B.: End-to-end low-resource lip-reading with maxout Cnn and Lstm. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB. IEEE, pp. 2511–2515 (2018). https://ieeexplore.ieee.org/document/8462280/

  9. Prajwal, K.R., Afouras, T., Zisserman, A.: Sub-word level lip reading with visual attention (2021). arXiv:2110.07603

  10. Fenghour, S., Chen, D., Guo, K., Li, B., Xiao, P.: Deep learning-based automated lip-reading: a survey. IEEE Access, 9 121184–121205 (2021). https://ieeexplore.ieee.org/document/9522117/

  11. Hao, M., Mamut, M., Ubul, K.: A survey of lipreading methods based on deep learning. In: 2020 2nd International Conference on Image Processing and Machine Vision, Bangkok Thailand. ACM, pp. 31–39 (2020). https://dl.acm.org/doi/10.1145/3421558.3421563

  12. Alam, M., Samad, M., Vidyaratne, L., Glandon, A., Iftekharuddin, K.: Survey on deep neural networks in speech and vision systems. Neurocomputing 417, 302–321 (2020). https://linkinghub.elsevier.com/retrieve/pii/S0925231220311619

  13. Bhaskar, S., Thasleema, T.M., Rajesh, R.: A survey on different visual speech recognition techniques. In: Nagabhushan, P., Guru, D.S., Shekar, B.H., Kumar, Y.H.S. (eds.) Data Analytics and Learning. LNNS, vol. 43, pp. 307–316. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-2514-4_26

    Chapter  Google Scholar 

  14. Fernandez-Lopez, A., Sukno, F.M.: Survey on automatic lip-reading in the era of deep learning. Image Vis. Comput. 78, 53–72 (2018). https://linkinghub.elsevier.com/retrieve/pii/S0262885618301276

  15. Fernandez-Lopez, A., Martinez, O., Sukno, F.M.: Towards estimating the upper bound of visual-speech recognition: the visual lip-reading feasibility database. In: 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition, Washington, DC, USA. IEEE, pp. 208–215 (2017). http://ieeexplore.ieee.org/document/7961743/

  16. Zhang, Y., Yang, S., Xiao, J., Shan, S., Chen, X.: Can we read speech beyond the lips? Rethinking RoI selection for deep visual speech recognition (2020). arXiv Version Number: 2. https://arxiv.org/abs/2003.03206

  17. Lu, Y., Zhu, X., Xiao, K.: Unsupervised lip segmentation based on quad-tree MRF framework in wavelet domain. Measurement 141, 95–101 (2019). https://linkinghub.elsevier.com/retrieve/pii/S0263224119302180

  18. Lu, Y., Liu, Q.: Lip segmentation using automatic selected initial contours based on localized active contour model. EURASIP J. Image Video Process. 2018(1), 7 (2018). https://jivp-eurasipjournals.springeropen.com/articles/10.1186/s13640-017-0243-9

  19. Radha, N., Shahina, A., Khan, N.: Visual speech recognition using fusion of motion and geometric features. Procedia Comput. Sci. 171, 924–933 (2020). https://linkinghub.elsevier.com/retrieve/pii/S1877050920310760

  20. Weng, X., Kitani, K.: Learning spatio-temporal features with two-stream deep 3D CNNs for lipreading (2019). arXiv:1905.02540. http://arxiv.org/abs/1905.02540

  21. Lu, Y., Yan, J.: automatic lip reading using convolution neural network and bidirectional long short-term memory. Int. J. Pattern Recog. Artif. Intell. 34(01), 2054003 (2020). https://www.worldscientific.com/doi/abs/10.1142/S0218001420540038

  22. Mesbah, A., Berrahou, A., Hammouchi, H., Berbia, H., Qjidaa, H., Daoudi, M.: Lip reading with Hahn convolutional neural networks. Image Vis. Comput. 88, 76–83 (2019). https://linkinghub.elsevier.com/retrieve/pii/S0262885619300605

  23. Ma, X., Zhang, H., Li, Y.: Feature extraction method for lip-reading under variant lighting conditions. In: Proceedings of the 9th International Conference on Machine Learning and Computing, Singapore. ACM, pp. 320–326 (2017). https://dl.acm.org/doi/10.1145/3055635.3056576

  24. Jeon, S., Elsharkawy, A., Kim, M.S.: Lipreading architecture based on multiple convolutional neural networks for sentence-level visual speech recognition. Sensors 22(1), 72 (2021). https://www.mdpi.com/1424-8220/22/1/72

  25. Wang, C.: Multi-grained spatio-temporal modeling for lip-reading. arXiv Version Number: 2 (2019). https://arxiv.org/abs/1908.11618

  26. Fenghour, S., Chen, D., Guo, K., Xiao, P.: Lip reading sentences using deep learning with only visual cues. IEEE Access, 8, 215 516–215 530 (2020). https://ieeexplore.ieee.org/document/9272286/

  27. Fenghour, S., Chen, D., Guo, K., Li, B., Xiao, P.: An effective conversion of visemes to words for high-performance automatic lipreading. Sensors 21(23), 7890 (2021). https://www.mdpi.com/1424-8220/21/23/7890

  28. Martinez, B., Ma, P., Petridis, S., Pantic, M.: Lipreading using temporal convolutional networks. arXiv Version Number: 1 (2020). https://arxiv.org/abs/2001.08702

  29. Lu, Y., Li, H.: Automatic lip-reading system based on deep convolutional neural network and attention-based long short-term memory. Appl. Sci. 9(8), 1599 (2019). https://www.mdpi.com/2076-3417/9/8/1599

  30. Afouras, T., Chung, J.S., Zisserman, A.: ASR is all you need: cross-modal distillation for lip reading (2020). arXiv:1911.12747 [cs, eess]. http://arxiv.org/abs/1911.12747

  31. Gupta, A.K., Gupta, P., Rahtu, E.: FATALRead - fooling visual speech recognition models: put words on lips. Appl. Intell. (2021). https://link.springer.com/10.1007/s10489-021-02846-w

Download references

Acknowledgements

This work is funded by FCT/MEC through national funds and, when applicable, co-funded by the FEDER-PT2020 partnership agreement under the project UIDB/00308/2020.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paulo Coelho .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Santos, C., Cunha, A., Coelho, P. (2023). A Review on Deep Learning-Based Automatic Lipreading. In: Cunha, A., M. Garcia, N., Marx Gómez, J., Pereira, S. (eds) Wireless Mobile Communication and Healthcare. MobiHealth 2022. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 484. Springer, Cham. https://doi.org/10.1007/978-3-031-32029-3_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-32029-3_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-32028-6

  • Online ISBN: 978-3-031-32029-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics