Abstract
Lip reading is a fine-grained video understanding task that endeavors to recognize speech content by analyzing the movement of the speaker’s mouth. In recent times, 3D-ResNet-18 has become the favored front-end network for most of the lip reading methods. However, a single 3D CNN layer within the 3D-ResNet-18-based front-end network might not have enough representation power to extract temporal features. To address this issue, we propose the incorporation of Temporal Adaptive Module (TAM) into the front-end network of lip reading methods. TAM is an uncomplicated temporal module that consists of two branches: a local branch that provides location-sensitive information, and a global branch that focuses on capturing long-term temporal dependencies. This combination of branches helps capture complex temporal structures and facilitates robust temporal modeling. Taking global and local relationships into consideration explicitly improves the feature representation. It can be easily used in classical building blocks of networks. We conducted ablation studies to determine the optimal TAM structure and compared our results with various related approaches on the LRW dataset. Our experimental outcomes prove the superiority of our approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Sun, K., Yu, C., Shi, W., Liu, L., Shi, Y.: Lip-interact: improving mobile device interaction with silent speech commands. In: Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology, pp. 581–593 (2018)
Jha, A., Namboodiri, V.P., Jawahar, C.V.: Word spotting in silent lip videos. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE (2018)
Afouras, T., et al.: Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2018)
Rufai, S.Z., Selwal, A., Sharma, D.: On analysis of face liveness detection mechanisms via deep learning models. In: International Conference on Sustainable Computing and Data Communication Systems (ICSCDS), vol. 2022, pp. 59–64 (2022). https://doi.org/10.1109/ICSCDS53736.2022.9760922
Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Lipreading using convolutional neural network. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)
Stafylakis, T., Tzimiropoulos, G.J.A.P.A.: Combining residual networks with LSTMs for lipreading (2017)
Martinez, B., Ma, P., Petridis, S., Pantic, M.: Lipreading using temporal convolutional networks. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6319–6323. IEEE (2020)
Ma, P., Wang, Y., Shen, J., Petridis, S., Pantic, M.: Lip-reading with densely connected temporal convolutional networks. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2857–2866 (2021)
Xiao, J., Yang, S., Zhang, Y., Shan, S., Chen, X.: Deformation flow based two-stream network for lip reading. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 364–370. IEEE (2020)
Hao, M., et al.: How to use time information effectively? Combining with time shift module for lipreading. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE (2021)
Liu, Z., et al.: TAM: temporal adaptive module for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54184-6_6
Mediapipe. https://mediapipe.dev/
Ma, P., Wang, Y., Petridis, S., Shen, J., Pantic, M.: Training strategies for improved lip-reading. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8472–8476 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746706
Miao, Z., Liu, H., Yang, B.: Part-based lipreading for audio-visual speech recognition. In: 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE (2020)
Tian, W., Zhang, H., Peng, C., Zhao, Z.-Q.: lipreading model based on whole-part collaborative learning. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2425–2429 (2022). https://doi.org/10.1109/ICASSP43922.2022.9747052
Ma, P., Martinez, B., Petridis, S., Pantic, M.: Towards practical lipreading with distilled and efficient models. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7608-7612 (2021). https://doi.org/10.1109/ICASSP39728.2021.9415063
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Zhao, X., Yang, S., Shan, S., Chen, X.: Mutual information maximization for effective lip reading. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 420-427 (2020). https://doi.org/10.1109/FG47880.2020.00133
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Huang, J., Teng, L., Xiao, Y., Zhu, A., Liu, X. (2024). Lip Reading Using Temporal Adaptive Module. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1964. Springer, Singapore. https://doi.org/10.1007/978-981-99-8141-0_26
Download citation
DOI: https://doi.org/10.1007/978-981-99-8141-0_26
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8140-3
Online ISBN: 978-981-99-8141-0
eBook Packages: Computer ScienceComputer Science (R0)