Abstract
Speech enhancement on mobile devices is a very challenging task due to the complex environmental noises. Recent works using lip-induced ultrasound signals for speech enhancement open up new possibilities to solve such a problem. However, these multi-modal methods cannot be used in many scenarios where ultrasound-based lip sensing is unreliable or completely absent. In this paper, we propose a novel paradigm that can exploit the prior learned ultrasound knowledge for multi-modal speech enhancement only with the audio input and an additional pre-enrollment speaker embedding. We design a memory network to store the ultrasound memory and learn the interrelationship between the audio and ultrasound modality. During inference, the memory network is able to recall the ultrasound representations from audio input to achieve multi-modal speech enhancement without needing real ultrasound signals. Moreover, we introduce a speaker embedding module to further boost the enhancement performance as well as avoid the degradation of the recalling when the noise level is high. We adopt an end-to-end multi-task manner to train the proposed framework and perform extensive evaluations on the collected dataset. The results show that our method yields comparable performance with audio-ultrasound methods and significantly outperforms the audio-only methods.
- T. Afouras, J. S. Chung, and A. Zisserman. 2018. The Conversation: Deep Audio-Visual Speech Enhancement. In INTERSPEECH.Google Scholar
- Ltd Beijing DataTang Technology Co. [n.d.]. aidatatang_200zh, a free Chinese Mandarin speech corpus. https://www.datatang.comGoogle Scholar
- S. Boll. 1979. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing 27, 2 (1979), 113--120.Google ScholarCross Ref
- Shang-Yi Chuang, Yu Tsao, Chen-Chou Lo, and Hsin-Min Wang. [n. d.]. Lite Audio-Visual Speech Enhancement. In Proc. Interspeech 2020.Google ScholarCross Ref
- J. S. Chung, A. Nagrani, and A. Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. In INTERSPEECH.Google Scholar
- Alexandre Defossez, Gabriel Synnaeve, and Yossi Adi. 2020. Real Time Speech Enhancement in the Waveform Domain. In Interspeech.Google Scholar
- Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. 2020. ECAPA-TDNN: Emphasized Channel Attention, propagation and aggregation in TDNN based speaker verification. In Interspeech 2020. 3830--3834.Google Scholar
- Feng Ding, Dong Wang, Qian Zhang, and Run Zhao. 2019. ASSV: handwritten signature verification using acoustic signals. In Adjunct Proceedings of the 2019 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2019 ACM International Symposium on Wearable Computers. Association for Computing Machinery, New York, NY, USA, 274--277.Google Scholar
- Han Ding, Yizhan Wang, Hao Li, Cui Zhao, Ge Wang, Wei Xi, and Jizhong Zhao. 2022. UltraSpeech: Speech Enhancement by Interaction between Ultrasound and Speech. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 3 (2022), 1--25.Google ScholarDigital Library
- Y. Ephraim and D. Malah. 1984. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing 32, 6 (1984), 1109--1121.Google ScholarCross Ref
- Y. Ephraim and H.L. Van Trees. 1995. A signal subspace approach for speech enhancement. IEEE Transactions on Speech and Audio Processing 3, 4 (1995), 251--266.Google ScholarCross Ref
- Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T Freeman, and Michael Rubinstein. 2018. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:1804.03619 (2018).Google Scholar
- Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, and Michael Rubinstein. 2018. Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation. ACM Trans. Graph. 37, 4, Article 112 (jul 2018), 11 pages.Google ScholarDigital Library
- Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang, Xiaofei Wang, Zhuo Chen, and Xuedong Huang. 2021. Personalized Speech Enhancement: New Models and Comprehensive Evaluation. arXiv:2110.09625Google Scholar
- Yongjian Fu, Shuning Wang, Linghui Zhong, Lili Chen, Ju Ren, and Yaoxue Zhang. 2023. SVoice: Enabling Voice Communication in Silence via Acoustic Sensing on Commodity Devices. In Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems (SenSys '22). Association for Computing Machinery, New York, NY, USA, 622--636.Google Scholar
- John Garofolo, L Lamel, W Fisher, Jonathan Fiscus, D Pallett, and Nancy Dahlgren. 1993. Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM TIMIT.Google Scholar
- Heitor R. Guimarães, Hitoshi Nagano, and Diego W. Silva. 2020. Monaural speech enhancement through deep wave-U-net. Expert Systems with Applications 158 (2020), 113582.Google ScholarCross Ref
- Xiang Hao, Xiangdong Su, Radu Horaud, and Xiaofei Li. 2021. Fullsubnet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6633--6637.Google Scholar
- Sindhu B Hegde, K R Prajwal, Rudrabha Mukhopadhyay, Vinay Namboodiri, and C.V. Jawahar. 2021. Visual Speech Enhancement Without A Real Visual Stream. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). 1925--1934.Google Scholar
- John R. Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe. 2016. Deep clustering: Discriminative embeddings for segmentation and separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 31--35.Google ScholarDigital Library
- Guoning Hu and DeLiang Wang. 2010. A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Transactions on Audio, Speech, and Language Processing 18, 8 (2010), 2067--2079.Google ScholarDigital Library
- Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang, Yihui Fu, Jian Wu, Bihong Zhang, and Lei Xie. 2020. DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement. ArXiv abs/2008.00264 (2020).Google Scholar
- Xingyu Na Bengu Wu Hao Zheng Hui Bu, Jiayu Du. 2017. AIShell-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline. In Oriental COCOSDA 2017. Submitted.Google Scholar
- Junkun Chen Xintong Li Renjie Zheng Yuxin Huang Xiaojie Chen Enlei Gong Zeyu Chen Xiaoguang Hu dianhai yu Yanjun Ma Liang Huang Hui Zhang, Tian Yuan. 2022. PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations. Association for Computational Linguistics.Google Scholar
- Minsu Kim, Joanna Hong, Se Jin Park, and Yong Man Ro. 2021. Multi-Modality Associative Bridging Through Memory: Speech Sound Recollected From Face Video. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 296--306.Google ScholarCross Ref
- Minsu Kim, Jeong Hun Yeo, and Yong Man Ro. 2022. Distinguishing Homophenes using Multi-head Visual-audio Memory for Lip Reading. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, Vol. 22.Google ScholarCross Ref
- Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The annals of mathematical statistics 22, 1 (1951), 79--86.Google Scholar
- H. Lev-Ari and Y. Ephraim. 2003. Extension of the signal subspace speech enhancement approach to colored noise. IEEE Signal Processing Letters 10, 4 (2003), 104--106.Google ScholarCross Ref
- Dong Li, Jialin Liu, Sunghoon Ivan Lee, and Jie Xiong. 2022. LASense: Pushing the Limits of Fine-grained Activity Sensing Using Acoustic Signals. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 6, 1, Article 21 (mar 2022), 27 pages.Google ScholarDigital Library
- Dong Li, Jialin Liu, Sunghoon Ivan Lee, and Jie Xiong. 2023. Room-Scale Hand Gesture Recognition Using Smart Speakers. In Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems (SenSys '22). Association for Computing Machinery, New York, NY, USA, 462--475.Google Scholar
- Jialin Liu, Dong Li, Lei Wang, and Jie Xiong. 2021. BlinkListener: " Listen" to Your Eye Blink Using Your Smartphone. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 2 (2021), 1--27.Google ScholarDigital Library
- Li Lu, Jiadi Yu, Yingying Chen, Hongbo Liu, Yanmin Zhu, Yunfei Liu, and Minglu Li. 2018. LipPass: Lip Reading-based User Authentication on Smartphones Leveraging Acoustic Signals. In IEEE INFOCOM 2018 - IEEE Conference on Computer Communications. 1466--1474.Google Scholar
- Yi Luo and Nima Mesgarani. 2019. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (Aug 2019), 1256--1266.Google Scholar
- Koichiro Matsuo and Jeffrey B. Palmer. 2010. Kinematic linkage of the tongue, jaw, and hyoid during eating and speech. Archives of Oral Biology 55, 4 (2010), 325--331.Google ScholarCross Ref
- Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature 264, 5588 (1976), 746--748.Google Scholar
- Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. 2016. Key-Value Memory Networks for Directly Reading Documents. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 1400--1409.Google ScholarCross Ref
- A. Nagrani, J. S. Chung, and A. Zisserman. 2017. VoxCeleb: a large-scale speaker identification dataset. In INTERSPEECH.Google Scholar
- Hao Pan, Yi-Chao Chen, Qi Ye, and Guangtao Xue. 2021. Magicinput: Training-free multi-lingual finger input system using data augmentation based on mnists. In Proceedings of the 20th International Conference on Information Processing in Sensor Networks (co-located with CPS-IoT Week 2021). 119--131.Google ScholarDigital Library
- Se Rim Park and Jinwon Lee. 2016. A Fully Convolutional Neural Network for Speech Enhancement. ArXiv abs/1609.07132 (2016).Google Scholar
- Sarah Partan and Peter Marler. 1999. Communication goes multimodal. Science 283, 5406 (1999), 1272--1273.Google ScholarCross Ref
- Santiago Pascual, Antonio Bonafonte, and Joan Serrà. 2017. SEGAN: Speech Enhancement Generative Adversarial Network. CoRR abs/1703.09452 (2017). arXiv:1703.09452Google Scholar
- Nicolai F Pedersen, Torsten Dau, Lars Kai Hansen, and Jens Hjortkjær. 2022. Modulation transfer functions for audiovisual speech. PLOS Computational Biology 18, 7 (2022), e1010273.Google ScholarCross Ref
- Stig Karlsson Per Lindblad and Eva Heller. 1991. Mandibular movements in speech phrases---A syllabic quasiregular continuous oscillation. Scandinavian Journal of Logopedics and Phoniatrics 16, 1-2 (1991), 36--42.Google Scholar
- K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C.V. Jawahar. 2020. A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. In Proceedings of the 28th ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, 484--492.Google Scholar
- Yue Qin, Chun Yu, Zhaoheng Li, Mingyuan Zhong, Yukang Yan, and Yuanchun Shi. 2021. ProxiMic: Convenient Voice Activation via Close-to-Mic Speech Detected by a Single Microphone. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI '21). Association for Computing Machinery, New York, NY, USA, Article 8, 12 pages.Google ScholarDigital Library
- A.W. Rix, J.G. Beerends, M.P. Hollier, and A.P. Hekstra. 2001. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), Vol. 2. 749-752 vol.2.Google Scholar
- Ke Sun and Xinyu Zhang. 2021. UltraSE: Single-Channel Speech Enhancement Using Ultrasound. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking (New Orleans, Louisiana) (MobiCom '21). Association for Computing Machinery, New York, NY, USA, 160--173.Google ScholarDigital Library
- Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen. 2010. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In 2010 IEEE international conference on acoustics, speech and signal processing. IEEE, 4214--4217.Google ScholarCross Ref
- Emmanuel Vincent, Rémi Gribonval, and Cédric Févotte. 2006. Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing 14, 4 (2006), 1462--1469.Google ScholarDigital Library
- Volcengine. 2023. Volcengine ASR. https://www.volcengine.com/product/asr.Google Scholar
- Tran Huy Vu, Archan Misra, Quentin Roy, Kenny Choo Tsu Wei, and Youngki Lee. 2018. Smartwatch-based early gesture detection 8 trajectory tracking for interactive gesture-driven applications. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 1 (2018), 1--27.Google ScholarDigital Library
- Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez-Moreno. 2017. Generalized End-to-End Loss for Speaker Verification. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017), 4879--4883.Google Scholar
- Dong Wang and Xuewei Zhang. 2015. THCHS-30: A Free Chinese Speech Corpus. ArXiv abs/1512.01882 (2015).Google Scholar
- Feng Wang and Huaping Liu. 2021. Understanding the behaviour of contrastive loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2495--2504.Google ScholarCross Ref
- Quan Wang, Hannah Muckenhirn, Kevin Wilson, Prashant Sridhar, Zelin Wu, John R. Hershey, Rif A. Saurous, Ron J. Weiss, Ye Jia, and Ignacio Lopez Moreno. 2019. VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking. In Proc. Interspeech 2019. 2728--2732.Google ScholarCross Ref
- Wei Wang, Alex X Liu, and Ke Sun. 2016. Device-free gesture tracking using acoustic signals. In Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking. 82--94.Google Scholar
- Yuxuan Wang, Arun Narayanan, and DeLiang Wang. 2014. On Training Targets for Supervised Speech Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22, 12 (2014), 1849--1858.Google ScholarDigital Library
- Libing Wu, Jingxiao Yang, Man Zhou, Yanjiao Chen, and Qian Wang. 2019. LVID: A multimodal biometrics authentication system on smartphones. IEEE Transactions on Information Forensics and Security 15 (2019), 1572--1585.Google ScholarDigital Library
- Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2015. A Regression Approach to Speech Enhancement Based on Deep Neural Networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 1 (2015), 7--19.Google ScholarDigital Library
- Dacheng Yin, Chong Luo, Zhiwei Xiong, and Wenjun Zeng. 2019. PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network. In AAAI Conference on Artificial Intelligence.Google Scholar
- Sangki Yun, Yi-Chao Chen, Huihuang Zheng, Lili Qiu, and Wenguang Mao. 2017. Strata: Fine-grained acoustic-based device-free tracking. In Proceedings of the 15th annual international conference on mobile systems, applications, and services. 15--28.Google ScholarDigital Library
- Qian Zhang, Yubin Lan, Kaiyi Guo, and Dong Wang. 2024. Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic Sensing. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, 2 (2024), 1--29.Google ScholarDigital Library
- Qian Zhang, Dong Wang, Run Zhao, and Yinggang Yu. 2021. SoundLip: Enabling Word and Sentence-level Lip Interaction for Smart Devices. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 5, 1, Article 43 (mar 2021), 28 pages.Google ScholarDigital Library
- Qian Zhang, Dong Wang, Run Zhao, Yinggang Yu, and Junjie Shen. 2021. Sensing to hear: Speech enhancement for mobile devices using acoustic signals. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 3 (2021), 1--30.Google ScholarDigital Library
- Yongzhao Zhang, Wei-Hsiang Huang, Chih-Yun Yang, Wen-Ping Wang, Yi-Chao Chen, Chuang-Wen You, Da-Yuan Huang, Guangtao Xue, and Jiadi Yu. 2020. Endophasia: Utilizing Acoustic-Based Imaging for Issuing Contact-Free Silent Speech Commands. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 4, 1, Article 37 (mar 2020), 26 pages.Google ScholarDigital Library
- Run Zhao, Dong Wang, Qian Zhang, Xueyi Jin, and Ke Liu. 2021. Smartphone-based Handwritten Signature Verification using Acoustic Signals. Proc. ACM Hum.-Comput. Interact. 5, ISS, Article 499 (nov 2021), 26 pages.Google ScholarDigital Library
- Kateřina Žmolíková, Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Tomohiro Nakatani, Lukáš Burget, and Jan Černocký. 2019. SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures. IEEE Journal of Selected Topics in Signal Processing 13, 4 (2019), 800--814.Google ScholarCross Ref
Index Terms
- Sensing to Hear through Memory: Ultrasound Speech Enhancement without Real Ultrasound Signals
Recommendations
Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic Sensing
Silent Speech Interfaces (SSI) on mobile devices offer a privacy-friendly alternative to conventional voice input methods. Previous research has primarily focused on smartphones. In this paper, we introduce Lipwatch, a novel system that utilizes acoustic ...
Sensing to Hear: Speech Enhancement for Mobile Devices Using Acoustic Signals
Voice interactions and voice messages on mobile phones are rapidly growing in popularity. However, the user experience of these services is still worse than desired in noisy environments, especially in multi-talker scenarios, where the phone can only ...
Lip synchronization from Thai speech
VRCAI '11: Proceedings of the 10th International Conference on Virtual Reality Continuum and Its Applications in IndustryLip synchronization in character animation is generally done in animation films and games, consuming workload and cost during the animation development process. In this paper, we focus on reducing the cost and workload in this process, and apply this ...
Comments