Let There Be Sound: Reconstructing High Quality Speech from Silent Videos

Authors

  • Ji-Hoon Kim Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea
  • Jaehun Kim Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea
  • Joon Son Chung Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea

DOI:

https://doi.org/10.1609/aaai.v38i3.28055

Keywords:

CV: Applications, NLP: Speech, CV: Multi-modal Vision, ML: Multimodal Learning, ML: Deep Generative Models & Autoencoders

Abstract

The goal of this work is to reconstruct high quality speech from lip motions alone, a task also known as lip-to-speech. A key challenge of lip-to-speech systems is the one-to-many mapping caused by (1) the existence of homophenes and (2) multiple speech variations, resulting in a mispronounced and over-smoothed speech. In this paper, we propose a novel lip-to-speech system that significantly improves the generation quality by alleviating the one-to-many mapping problem from multiple perspectives. Specifically, we incorporate (1) self-supervised speech representations to disambiguate homophenes, and (2) acoustic variance information to model diverse speech styles. Additionally, to better solve the aforementioned problem, we employ a flow based post-net which captures and refines the details of the generated speech. We perform extensive experiments on two datasets, and demonstrate that our method achieves the generation quality close to that of real human utterance, outperforming existing methods in terms of speech naturalness and intelligibility by a large margin. Synthesised samples are available at our demo page: https://mm.kaist.ac.kr/projects/LTBS.

Published

2024-03-24

How to Cite

Kim, J.-H., Kim, J., & Chung, J. S. (2024). Let There Be Sound: Reconstructing High Quality Speech from Silent Videos. Proceedings of the AAAI Conference on Artificial Intelligence, 38(3), 2759-2767. https://doi.org/10.1609/aaai.v38i3.28055

Issue

Section

AAAI Technical Track on Computer Vision II