Lip Reading Using Temporal Adaptive Module

Huang, Jian; Teng, Lianwei; Xiao, Yewei; Zhu, Aosu; Liu, Xuanming

doi:10.1007/978-981-99-8141-0_26

Jian Huang ORCID: orcid.org/0009-0006-1898-3756¹⁰,
Lianwei Teng ORCID: orcid.org/0000-0001-6523-9731¹¹,
Yewei Xiao ORCID: orcid.org/0000-0001-9689-3760¹⁰,
Aosu Zhu¹⁰ &
…
Xuanming Liu¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1964))

Included in the following conference series:

International Conference on Neural Information Processing

332 Accesses

Abstract

Lip reading is a fine-grained video understanding task that endeavors to recognize speech content by analyzing the movement of the speaker’s mouth. In recent times, 3D-ResNet-18 has become the favored front-end network for most of the lip reading methods. However, a single 3D CNN layer within the 3D-ResNet-18-based front-end network might not have enough representation power to extract temporal features. To address this issue, we propose the incorporation of Temporal Adaptive Module (TAM) into the front-end network of lip reading methods. TAM is an uncomplicated temporal module that consists of two branches: a local branch that provides location-sensitive information, and a global branch that focuses on capturing long-term temporal dependencies. This combination of branches helps capture complex temporal structures and facilitates robust temporal modeling. Taking global and local relationships into consideration explicitly improves the feature representation. It can be easily used in classical building blocks of networks. We conducted ablation studies to determine the optimal TAM structure and compared our results with various related approaches on the LRW dataset. Our experimental outcomes prove the superiority of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Sun, K., Yu, C., Shi, W., Liu, L., Shi, Y.: Lip-interact: improving mobile device interaction with silent speech commands. In: Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology, pp. 581–593 (2018)
Google Scholar
Jha, A., Namboodiri, V.P., Jawahar, C.V.: Word spotting in silent lip videos. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE (2018)
Google Scholar
Afouras, T., et al.: Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2018)
Google Scholar
Rufai, S.Z., Selwal, A., Sharma, D.: On analysis of face liveness detection mechanisms via deep learning models. In: International Conference on Sustainable Computing and Data Communication Systems (ICSCDS), vol. 2022, pp. 59–64 (2022). https://doi.org/10.1109/ICSCDS53736.2022.9760922
Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Lipreading using convolutional neural network. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)
Google Scholar
Stafylakis, T., Tzimiropoulos, G.J.A.P.A.: Combining residual networks with LSTMs for lipreading (2017)
Google Scholar
Martinez, B., Ma, P., Petridis, S., Pantic, M.: Lipreading using temporal convolutional networks. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6319–6323. IEEE (2020)
Google Scholar
Ma, P., Wang, Y., Shen, J., Petridis, S., Pantic, M.: Lip-reading with densely connected temporal convolutional networks. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2857–2866 (2021)
Google Scholar
Xiao, J., Yang, S., Zhang, Y., Shan, S., Chen, X.: Deformation flow based two-stream network for lip reading. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 364–370. IEEE (2020)
Google Scholar
Hao, M., et al.: How to use time information effectively? Combining with time shift module for lipreading. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE (2021)
Google Scholar
Liu, Z., et al.: TAM: temporal adaptive module for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
Google Scholar
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54184-6_6
Chapter Google Scholar
Mediapipe. https://mediapipe.dev/
Ma, P., Wang, Y., Petridis, S., Shen, J., Pantic, M.: Training strategies for improved lip-reading. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8472–8476 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746706
Miao, Z., Liu, H., Yang, B.: Part-based lipreading for audio-visual speech recognition. In: 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE (2020)
Google Scholar
Tian, W., Zhang, H., Peng, C., Zhao, Z.-Q.: lipreading model based on whole-part collaborative learning. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2425–2429 (2022). https://doi.org/10.1109/ICASSP43922.2022.9747052
Ma, P., Martinez, B., Petridis, S., Pantic, M.: Towards practical lipreading with distilled and efficient models. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7608-7612 (2021). https://doi.org/10.1109/ICASSP39728.2021.9415063
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Zhao, X., Yang, S., Shan, S., Chen, X.: Mutual information maximization for effective lip reading. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 420-427 (2020). https://doi.org/10.1109/FG47880.2020.00133

Download references

Author information

Authors and Affiliations

Institute of Automation and Electronic Information, Xiangtan University, Xiangtan, China
Jian Huang, Yewei Xiao, Aosu Zhu & Xuanming Liu
College of Intelligent Science, National University of Defense Technology, Changsha, China
Lianwei Teng

Authors

Jian Huang
View author publications
You can also search for this author in PubMed Google Scholar
Lianwei Teng
View author publications
You can also search for this author in PubMed Google Scholar
Yewei Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Aosu Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Xuanming Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yewei Xiao .

Editor information

Editors and Affiliations

School of Automation, Central South University, Changsha, China
Biao Luo
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Long Cheng
Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou, China
Zheng-Guang Wu
School of Automation, Guangdong University of Technology, Guangzhou, China
Hongyi Li
School of Electrical Engineering and Telecommunications, UNSW Sydney, Sydney, NSW, Australia
Chaojie Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, J., Teng, L., Xiao, Y., Zhu, A., Liu, X. (2024). Lip Reading Using Temporal Adaptive Module. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1964. Springer, Singapore. https://doi.org/10.1007/978-981-99-8141-0_26

Download citation

DOI: https://doi.org/10.1007/978-981-99-8141-0_26
Published: 26 November 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8140-3
Online ISBN: 978-981-99-8141-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics