skip to main content
research-article
Free Access

Sensing to Hear through Memory: Ultrasound Speech Enhancement without Real Ultrasound Signals

Published:15 May 2024Publication History
Skip Abstract Section

Abstract

Speech enhancement on mobile devices is a very challenging task due to the complex environmental noises. Recent works using lip-induced ultrasound signals for speech enhancement open up new possibilities to solve such a problem. However, these multi-modal methods cannot be used in many scenarios where ultrasound-based lip sensing is unreliable or completely absent. In this paper, we propose a novel paradigm that can exploit the prior learned ultrasound knowledge for multi-modal speech enhancement only with the audio input and an additional pre-enrollment speaker embedding. We design a memory network to store the ultrasound memory and learn the interrelationship between the audio and ultrasound modality. During inference, the memory network is able to recall the ultrasound representations from audio input to achieve multi-modal speech enhancement without needing real ultrasound signals. Moreover, we introduce a speaker embedding module to further boost the enhancement performance as well as avoid the degradation of the recalling when the noise level is high. We adopt an end-to-end multi-task manner to train the proposed framework and perform extensive evaluations on the collected dataset. The results show that our method yields comparable performance with audio-ultrasound methods and significantly outperforms the audio-only methods.

References

  1. T. Afouras, J. S. Chung, and A. Zisserman. 2018. The Conversation: Deep Audio-Visual Speech Enhancement. In INTERSPEECH.Google ScholarGoogle Scholar
  2. Ltd Beijing DataTang Technology Co. [n.d.]. aidatatang_200zh, a free Chinese Mandarin speech corpus. https://www.datatang.comGoogle ScholarGoogle Scholar
  3. S. Boll. 1979. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing 27, 2 (1979), 113--120.Google ScholarGoogle ScholarCross RefCross Ref
  4. Shang-Yi Chuang, Yu Tsao, Chen-Chou Lo, and Hsin-Min Wang. [n. d.]. Lite Audio-Visual Speech Enhancement. In Proc. Interspeech 2020.Google ScholarGoogle ScholarCross RefCross Ref
  5. J. S. Chung, A. Nagrani, and A. Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. In INTERSPEECH.Google ScholarGoogle Scholar
  6. Alexandre Defossez, Gabriel Synnaeve, and Yossi Adi. 2020. Real Time Speech Enhancement in the Waveform Domain. In Interspeech.Google ScholarGoogle Scholar
  7. Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. 2020. ECAPA-TDNN: Emphasized Channel Attention, propagation and aggregation in TDNN based speaker verification. In Interspeech 2020. 3830--3834.Google ScholarGoogle Scholar
  8. Feng Ding, Dong Wang, Qian Zhang, and Run Zhao. 2019. ASSV: handwritten signature verification using acoustic signals. In Adjunct Proceedings of the 2019 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2019 ACM International Symposium on Wearable Computers. Association for Computing Machinery, New York, NY, USA, 274--277.Google ScholarGoogle Scholar
  9. Han Ding, Yizhan Wang, Hao Li, Cui Zhao, Ge Wang, Wei Xi, and Jizhong Zhao. 2022. UltraSpeech: Speech Enhancement by Interaction between Ultrasound and Speech. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 3 (2022), 1--25.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Y. Ephraim and D. Malah. 1984. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing 32, 6 (1984), 1109--1121.Google ScholarGoogle ScholarCross RefCross Ref
  11. Y. Ephraim and H.L. Van Trees. 1995. A signal subspace approach for speech enhancement. IEEE Transactions on Speech and Audio Processing 3, 4 (1995), 251--266.Google ScholarGoogle ScholarCross RefCross Ref
  12. Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T Freeman, and Michael Rubinstein. 2018. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:1804.03619 (2018).Google ScholarGoogle Scholar
  13. Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, and Michael Rubinstein. 2018. Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation. ACM Trans. Graph. 37, 4, Article 112 (jul 2018), 11 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang, Xiaofei Wang, Zhuo Chen, and Xuedong Huang. 2021. Personalized Speech Enhancement: New Models and Comprehensive Evaluation. arXiv:2110.09625Google ScholarGoogle Scholar
  15. Yongjian Fu, Shuning Wang, Linghui Zhong, Lili Chen, Ju Ren, and Yaoxue Zhang. 2023. SVoice: Enabling Voice Communication in Silence via Acoustic Sensing on Commodity Devices. In Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems (SenSys '22). Association for Computing Machinery, New York, NY, USA, 622--636.Google ScholarGoogle Scholar
  16. John Garofolo, L Lamel, W Fisher, Jonathan Fiscus, D Pallett, and Nancy Dahlgren. 1993. Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM TIMIT.Google ScholarGoogle Scholar
  17. Heitor R. Guimarães, Hitoshi Nagano, and Diego W. Silva. 2020. Monaural speech enhancement through deep wave-U-net. Expert Systems with Applications 158 (2020), 113582.Google ScholarGoogle ScholarCross RefCross Ref
  18. Xiang Hao, Xiangdong Su, Radu Horaud, and Xiaofei Li. 2021. Fullsubnet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6633--6637.Google ScholarGoogle Scholar
  19. Sindhu B Hegde, K R Prajwal, Rudrabha Mukhopadhyay, Vinay Namboodiri, and C.V. Jawahar. 2021. Visual Speech Enhancement Without A Real Visual Stream. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). 1925--1934.Google ScholarGoogle Scholar
  20. John R. Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe. 2016. Deep clustering: Discriminative embeddings for segmentation and separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 31--35.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Guoning Hu and DeLiang Wang. 2010. A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Transactions on Audio, Speech, and Language Processing 18, 8 (2010), 2067--2079.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang, Yihui Fu, Jian Wu, Bihong Zhang, and Lei Xie. 2020. DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement. ArXiv abs/2008.00264 (2020).Google ScholarGoogle Scholar
  23. Xingyu Na Bengu Wu Hao Zheng Hui Bu, Jiayu Du. 2017. AIShell-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline. In Oriental COCOSDA 2017. Submitted.Google ScholarGoogle Scholar
  24. Junkun Chen Xintong Li Renjie Zheng Yuxin Huang Xiaojie Chen Enlei Gong Zeyu Chen Xiaoguang Hu dianhai yu Yanjun Ma Liang Huang Hui Zhang, Tian Yuan. 2022. PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations. Association for Computational Linguistics.Google ScholarGoogle Scholar
  25. Minsu Kim, Joanna Hong, Se Jin Park, and Yong Man Ro. 2021. Multi-Modality Associative Bridging Through Memory: Speech Sound Recollected From Face Video. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 296--306.Google ScholarGoogle ScholarCross RefCross Ref
  26. Minsu Kim, Jeong Hun Yeo, and Yong Man Ro. 2022. Distinguishing Homophenes using Multi-head Visual-audio Memory for Lip Reading. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, Vol. 22.Google ScholarGoogle ScholarCross RefCross Ref
  27. Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The annals of mathematical statistics 22, 1 (1951), 79--86.Google ScholarGoogle Scholar
  28. H. Lev-Ari and Y. Ephraim. 2003. Extension of the signal subspace speech enhancement approach to colored noise. IEEE Signal Processing Letters 10, 4 (2003), 104--106.Google ScholarGoogle ScholarCross RefCross Ref
  29. Dong Li, Jialin Liu, Sunghoon Ivan Lee, and Jie Xiong. 2022. LASense: Pushing the Limits of Fine-grained Activity Sensing Using Acoustic Signals. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 6, 1, Article 21 (mar 2022), 27 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Dong Li, Jialin Liu, Sunghoon Ivan Lee, and Jie Xiong. 2023. Room-Scale Hand Gesture Recognition Using Smart Speakers. In Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems (SenSys '22). Association for Computing Machinery, New York, NY, USA, 462--475.Google ScholarGoogle Scholar
  31. Jialin Liu, Dong Li, Lei Wang, and Jie Xiong. 2021. BlinkListener: " Listen" to Your Eye Blink Using Your Smartphone. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 2 (2021), 1--27.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Li Lu, Jiadi Yu, Yingying Chen, Hongbo Liu, Yanmin Zhu, Yunfei Liu, and Minglu Li. 2018. LipPass: Lip Reading-based User Authentication on Smartphones Leveraging Acoustic Signals. In IEEE INFOCOM 2018 - IEEE Conference on Computer Communications. 1466--1474.Google ScholarGoogle Scholar
  33. Yi Luo and Nima Mesgarani. 2019. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (Aug 2019), 1256--1266.Google ScholarGoogle Scholar
  34. Koichiro Matsuo and Jeffrey B. Palmer. 2010. Kinematic linkage of the tongue, jaw, and hyoid during eating and speech. Archives of Oral Biology 55, 4 (2010), 325--331.Google ScholarGoogle ScholarCross RefCross Ref
  35. Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature 264, 5588 (1976), 746--748.Google ScholarGoogle Scholar
  36. Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. 2016. Key-Value Memory Networks for Directly Reading Documents. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 1400--1409.Google ScholarGoogle ScholarCross RefCross Ref
  37. A. Nagrani, J. S. Chung, and A. Zisserman. 2017. VoxCeleb: a large-scale speaker identification dataset. In INTERSPEECH.Google ScholarGoogle Scholar
  38. Hao Pan, Yi-Chao Chen, Qi Ye, and Guangtao Xue. 2021. Magicinput: Training-free multi-lingual finger input system using data augmentation based on mnists. In Proceedings of the 20th International Conference on Information Processing in Sensor Networks (co-located with CPS-IoT Week 2021). 119--131.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Se Rim Park and Jinwon Lee. 2016. A Fully Convolutional Neural Network for Speech Enhancement. ArXiv abs/1609.07132 (2016).Google ScholarGoogle Scholar
  40. Sarah Partan and Peter Marler. 1999. Communication goes multimodal. Science 283, 5406 (1999), 1272--1273.Google ScholarGoogle ScholarCross RefCross Ref
  41. Santiago Pascual, Antonio Bonafonte, and Joan Serrà. 2017. SEGAN: Speech Enhancement Generative Adversarial Network. CoRR abs/1703.09452 (2017). arXiv:1703.09452Google ScholarGoogle Scholar
  42. Nicolai F Pedersen, Torsten Dau, Lars Kai Hansen, and Jens Hjortkjær. 2022. Modulation transfer functions for audiovisual speech. PLOS Computational Biology 18, 7 (2022), e1010273.Google ScholarGoogle ScholarCross RefCross Ref
  43. Stig Karlsson Per Lindblad and Eva Heller. 1991. Mandibular movements in speech phrases---A syllabic quasiregular continuous oscillation. Scandinavian Journal of Logopedics and Phoniatrics 16, 1-2 (1991), 36--42.Google ScholarGoogle Scholar
  44. K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C.V. Jawahar. 2020. A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. In Proceedings of the 28th ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, 484--492.Google ScholarGoogle Scholar
  45. Yue Qin, Chun Yu, Zhaoheng Li, Mingyuan Zhong, Yukang Yan, and Yuanchun Shi. 2021. ProxiMic: Convenient Voice Activation via Close-to-Mic Speech Detected by a Single Microphone. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI '21). Association for Computing Machinery, New York, NY, USA, Article 8, 12 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. A.W. Rix, J.G. Beerends, M.P. Hollier, and A.P. Hekstra. 2001. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), Vol. 2. 749-752 vol.2.Google ScholarGoogle Scholar
  47. Ke Sun and Xinyu Zhang. 2021. UltraSE: Single-Channel Speech Enhancement Using Ultrasound. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking (New Orleans, Louisiana) (MobiCom '21). Association for Computing Machinery, New York, NY, USA, 160--173.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen. 2010. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In 2010 IEEE international conference on acoustics, speech and signal processing. IEEE, 4214--4217.Google ScholarGoogle ScholarCross RefCross Ref
  49. Emmanuel Vincent, Rémi Gribonval, and Cédric Févotte. 2006. Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing 14, 4 (2006), 1462--1469.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Volcengine. 2023. Volcengine ASR. https://www.volcengine.com/product/asr.Google ScholarGoogle Scholar
  51. Tran Huy Vu, Archan Misra, Quentin Roy, Kenny Choo Tsu Wei, and Youngki Lee. 2018. Smartwatch-based early gesture detection 8 trajectory tracking for interactive gesture-driven applications. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 1 (2018), 1--27.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez-Moreno. 2017. Generalized End-to-End Loss for Speaker Verification. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017), 4879--4883.Google ScholarGoogle Scholar
  53. Dong Wang and Xuewei Zhang. 2015. THCHS-30: A Free Chinese Speech Corpus. ArXiv abs/1512.01882 (2015).Google ScholarGoogle Scholar
  54. Feng Wang and Huaping Liu. 2021. Understanding the behaviour of contrastive loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2495--2504.Google ScholarGoogle ScholarCross RefCross Ref
  55. Quan Wang, Hannah Muckenhirn, Kevin Wilson, Prashant Sridhar, Zelin Wu, John R. Hershey, Rif A. Saurous, Ron J. Weiss, Ye Jia, and Ignacio Lopez Moreno. 2019. VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking. In Proc. Interspeech 2019. 2728--2732.Google ScholarGoogle ScholarCross RefCross Ref
  56. Wei Wang, Alex X Liu, and Ke Sun. 2016. Device-free gesture tracking using acoustic signals. In Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking. 82--94.Google ScholarGoogle Scholar
  57. Yuxuan Wang, Arun Narayanan, and DeLiang Wang. 2014. On Training Targets for Supervised Speech Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22, 12 (2014), 1849--1858.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Libing Wu, Jingxiao Yang, Man Zhou, Yanjiao Chen, and Qian Wang. 2019. LVID: A multimodal biometrics authentication system on smartphones. IEEE Transactions on Information Forensics and Security 15 (2019), 1572--1585.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2015. A Regression Approach to Speech Enhancement Based on Deep Neural Networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 1 (2015), 7--19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Dacheng Yin, Chong Luo, Zhiwei Xiong, and Wenjun Zeng. 2019. PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network. In AAAI Conference on Artificial Intelligence.Google ScholarGoogle Scholar
  61. Sangki Yun, Yi-Chao Chen, Huihuang Zheng, Lili Qiu, and Wenguang Mao. 2017. Strata: Fine-grained acoustic-based device-free tracking. In Proceedings of the 15th annual international conference on mobile systems, applications, and services. 15--28.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Qian Zhang, Yubin Lan, Kaiyi Guo, and Dong Wang. 2024. Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic Sensing. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, 2 (2024), 1--29.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Qian Zhang, Dong Wang, Run Zhao, and Yinggang Yu. 2021. SoundLip: Enabling Word and Sentence-level Lip Interaction for Smart Devices. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 5, 1, Article 43 (mar 2021), 28 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Qian Zhang, Dong Wang, Run Zhao, Yinggang Yu, and Junjie Shen. 2021. Sensing to hear: Speech enhancement for mobile devices using acoustic signals. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 3 (2021), 1--30.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Yongzhao Zhang, Wei-Hsiang Huang, Chih-Yun Yang, Wen-Ping Wang, Yi-Chao Chen, Chuang-Wen You, Da-Yuan Huang, Guangtao Xue, and Jiadi Yu. 2020. Endophasia: Utilizing Acoustic-Based Imaging for Issuing Contact-Free Silent Speech Commands. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 4, 1, Article 37 (mar 2020), 26 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Run Zhao, Dong Wang, Qian Zhang, Xueyi Jin, and Ke Liu. 2021. Smartphone-based Handwritten Signature Verification using Acoustic Signals. Proc. ACM Hum.-Comput. Interact. 5, ISS, Article 499 (nov 2021), 26 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Kateřina Žmolíková, Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Tomohiro Nakatani, Lukáš Burget, and Jan Černocký. 2019. SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures. IEEE Journal of Selected Topics in Signal Processing 13, 4 (2019), 800--814.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Sensing to Hear through Memory: Ultrasound Speech Enhancement without Real Ultrasound Signals

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
      Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies  Volume 8, Issue 2
      May 2024
      1330 pages
      EISSN:2474-9567
      DOI:10.1145/3665317
      Issue’s Table of Contents

      Copyright © 2024 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 15 May 2024
      Published in imwut Volume 8, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader