skip to main content
research-article

mSilent: Towards General Corpus Silent Speech Recognition Using COTS mmWave Radar

Published:28 March 2023Publication History
Skip Abstract Section

Abstract

Silent speech recognition (SSR) allows users to speak to the device without making a sound, avoiding being overheard or disturbing others. Compared to the video-based approach, wireless signal-based SSR can work when the user is wearing a mask and has fewer privacy concerns. However, previous wireless-based systems are still far from well-studied, e.g. they are only evaluated in corpus with highly limited size, making them only feasible for interaction with dozens of deterministic commands. In this paper, we present mSilent, a millimeter-wave (mmWave) based SSR system that can work in the general corpus containing thousands of daily conversation sentences. With the strong recognition capability, mSilent not only supports the more complex interaction with assistants, but also enables more general applications in daily life such as communication and input. To extract fine-grained articulatory features, we build a signal processing pipeline that uses a clustering-selection algorithm to separate articulatory gestures and generates a multi-scale detrended spectrogram (MSDS). To handle the complexity of the general corpus, we design an end-to-end deep neural network that consists of a multi-branch convolutional front-end and a Transformer-based sequence-to-sequence back-end. We collect a general corpus dataset of 1,000 daily conversation sentences that contains 21K samples of bi-modality data (mmWave and video). Our evaluation shows that mSilent achieves a 9.5% average word error rate (WER) at a distance of 1.5m, which is comparable to the performance of the state-of-the-art video-based approach. We also explore deploying mSilent in two typical scenarios of text entry and in-car assistant, and the less than 6% average WER demonstrates the potential of mSilent in general daily applications.

References

  1. Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2018. Deep Audio-Visual Speech Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018).Google ScholarGoogle Scholar
  2. Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando De Freitas. 2016. Lipnet: End-to-end Sentence-level Lipreading. arXiv preprint arXiv:1611.01599 (2016).Google ScholarGoogle Scholar
  3. Sue Atkins, Jeremy Clear, and Nicholas Ostler. 1992. Corpus Design Criteria. Literary and Linguistic Computing 7, 1 (1992), 1--16.Google ScholarGoogle ScholarCross RefCross Ref
  4. Suryoday Basak and Mahanth Gowda. 2022. mmspy: Spying Phone Calls using mmWave Radars. In Proceedings of 2022 IEEE Symposium on Security and Privacy (S&P '22). 1211--1228.Google ScholarGoogle ScholarCross RefCross Ref
  5. Dongjiang Cao, Ruofeng Liu, Hao Li, Shuai Wang, Wenchao Jiang, and Chris Xiaoxuan Lu. 2022. Cross Vision-RF Gait Re-Identification with Low-Cost RGB-D Cameras and MmWave Radars. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 6, 3 (2022), 1--25.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Zhe Chen, Tianyue Zheng, Chao Cai, and Jun Luo. 2021. MoVi-Fi: Motion-robust Vital Signs Waveform Recovery via Deep Interpreted RF Sensing. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking (MobiCom '21). 392--405.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Martin Cooke, Jon Barker, Stuart Cunningham, and Xu Shao. 2006. An Audio-Visual Corpus for Speech Perception and Automatic Speech Recognition. The Journal of the Acoustical Society of America 120, 5 (2006), 2421--2424.Google ScholarGoogle ScholarCross RefCross Ref
  8. Mihail Eric, Lakshmi Krishnan, Francois Charette, and Christopher D. Manning. 2017. Key-Value Retrieval Networks for Task-Oriented Dialogue. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue (SIGDIAL '17). 37--49.Google ScholarGoogle Scholar
  9. Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised Domain Adaptation by Backpropagation. In International Conference on Machine Learning (ICML '15). PMLR, 1180--1189.Google ScholarGoogle Scholar
  10. Yang Gao, Yincheng Jin, Jiyang Li, Seokmin Choi, and Zhanpeng Jin. 2020. EchoWhisper: Exploring an Acoustic-based Silent Speech Interface for Smartphone Users. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 4, 3 (2020), 1--27.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P Ghane, G Hossain, and A Tovar. 2015. Robust Understanding of EEG Patterns in Silent Speech. In Proceedings of 2015 National Aerospace and Electronics Conference (NAECON '15). IEEE, 282--289.Google ScholarGoogle ScholarCross RefCross Ref
  12. Google. 2022. Google Soli Products. https://www.atap.google.com/soli/products/Google ScholarGoogle Scholar
  13. Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. 2020. Conformer: Convolution-augmented Transformer for Speech Recognition. In Proceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH '20).Google ScholarGoogle ScholarCross RefCross Ref
  14. Unsoo Ha, Salah Assana, and Fadel Adib. 2020. Contactless Seismocardiography via Deep Learning Radars. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking (MobiCom '20). 1--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Uri Hadar, Timothy J Steiner, EC Grant, and F Clifford Rose. 1983. Kinematics of Head Movements Accompanying Speech during Conversation. Human Movement Science 2, 1-2 (1983), 35--46.Google ScholarGoogle ScholarCross RefCross Ref
  16. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16). 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  17. Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber, et al. 2001. Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-term Dependencies.Google ScholarGoogle Scholar
  18. Cesar Iovescu and Sandeep Rao. 2017. The Fundamentals of Millimeter Wave Sensors. Texas Instruments (2017), 1--8.Google ScholarGoogle Scholar
  19. Shekh MM Islam, Naoyuki Motoyama, Sergio Pacheco, and Victor M Lubecke. 2020. Non-Contact Vital Signs Monitoring for Multiple Subjects using a Millimeter Wave FMCW Automotive Radar. In Proceedings of 2020 IEEE/MTT-S International Microwave Symposium (IMS 20). 783--786.Google ScholarGoogle ScholarCross RefCross Ref
  20. Yincheng Jin, Yang Gao, Xuhai Xu, Seokmin Choi, Jiyang Li, Feng Liu, Zhengxiong Li, and Zhanpeng Jin. 2022. EarCommand: "Hearing" Your Silent Speech Commands In Ear. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 6, 2 (2022), 1--28.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2022. Scaling Laws for Neural Language Models. In Proceedings of the International Conference on Learning Representations (ICLR 22).Google ScholarGoogle Scholar
  22. Adam Kilgarriff and Tony Rose. 1998. Measures for Corpus Similarity and Homogeneity. In Proceedings of the Third Conference on Empirical Methods for Natural Language Processing (EMNLP '98). 46--52.Google ScholarGoogle Scholar
  23. Suyoun Kim, Takaaki Hori, and Shinji Watanabe. 2017. Joint CTC-attention based End-to-end Speech Recognition using Multi-task Learning. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '17). 4835--4839.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Yoon Kim and Alexander M Rush. 2016. Sequence-level Knowledge Distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP '16). The Association for Computational Linguistics, 1317--1327.Google ScholarGoogle ScholarCross RefCross Ref
  25. Naoki Kimura, Tan Gemicioglu, Jonathan Womack, Richard Li, Yuhui Zhao, Abdelkareem Bedri, Zixiong Su, Alex Olwal, Jun Rekimoto, and Thad Starner. 2022. SilentSpeller: Towards Mobile, Hands-free, Silent Speech Text Entry using Electropalatography. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI '22). 1--19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR '15).Google ScholarGoogle Scholar
  27. Hao Kong, Xiangyu Xu, Jiadi Yu, Qilin Chen, Chenguang Ma, Yingying Chen, Yi-Chao Chen, and Linghe Kong. 2022. M3Track: mmWave-based Multi-User 3D Posture Tracking. In Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services (MobiSys '22). 491--503.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Network in Network. arXiv preprint arXiv:1312.4400 (2013).Google ScholarGoogle Scholar
  29. Fan Liu, Yuanhao Cui, Christos Masouros, Jie Xu, Tony Xiao Han, Yonina C Eldar, and Stefano Buzzi. 2022. Integrated Sensing and Communications: Towards dual-Functional Wireless Networks for 6G and beyond. IEEE Journal on Selected Areas in Communications (2022).Google ScholarGoogle Scholar
  30. Li Lu, Jiadi Yu, Yingying Chen, Hongbo Liu, Yanmin Zhu, Linghe Kong, and Minglu Li. 2019. Lip Reading-based User Authentication through Acoustic Sensing on Smartphones. IEEE/ACM Transactions on Networking 27, 1 (2019), 447--460.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Pingchuan Ma, Stavros Petridis, and Maja Pantic. 2021. End-to-end Audio-Visual Speech Recognition with Conformers. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '21). 7613--7617.Google ScholarGoogle ScholarCross RefCross Ref
  32. Shuang Ma, Zhaoyang Zeng, Daniel McDuff, and Yale Song. 2021. Active Contrastive Learning of Audio-Visual Video Representations. In Proceedings of the International Conference on Learning Representations (ICLR '21).Google ScholarGoogle Scholar
  33. Alexander Neubeck and Luc Van Gool. 2006. Efficient Non-maximum Suppression. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR 06). IEEE, 850--855.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Laxmi Pandey, Khalad Hasan, and Ahmed Sabbir Arif. 2021. Acceptability of Speech and Silent Speech Input Methods in Private and Public. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI '21). 1--13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. 2019. Specaugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proceedings of the 20st Annual Conference of the International Speech Communication Association (INTERSPEECH '19).Google ScholarGoogle ScholarCross RefCross Ref
  36. KR Prajwal, Triantafyllos Afouras, and Andrew Zisserman. 2022. Sub-word Level Lip Reading with Visual Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR '22). 5162--5172.Google ScholarGoogle ScholarCross RefCross Ref
  37. Ahmed Rekik, Achraf Ben-Hamadou, and Walid Mahdi. 2016. An Adaptive Approach for Lip-reading using Image and Depth data. Multimedia Tools and Applications 75, 14 (2016), 8609--8636.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Himanshu Sahni, Abdelkareem Bedri, Gabriel Reyes, Pavleen Thukral, Zehua Guo, Thad Starner, and Maysam Ghovanloo. 2014. The Tongue and Ear Interface: a Wearable System for Silent Speech Recognition. In Proceedings of the 2014 ACM International Symposium on Wearable Computers (ISWC '14). 47--54.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Panneer Selvam Santhalingam, Al Amin Hosain, Ding Zhang, Parth Pathak, Huzefa Rangwala, and Raja Kushalnagar. 2020. mmASL: Environment-Independent ASL Gesture Recognition using 60 GHz Millimeter-wave Signals. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 4, 1 (2020), 1--30.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Neil Shah, Nirmesh J Shah, and Hemant A Patil. 2018. Effectiveness of Generative Adversarial Network for Non-Audible Murmur-to-Whisper Speech Conversion. In Proceedings of the 19st Annual Conference of the International Speech Communication Association (INTERSPEECH '18).Google ScholarGoogle ScholarCross RefCross Ref
  41. Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, and Abdelrahman Mohamed. 2022. Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction. In International Conference on Learning Representations (ICLR '22).Google ScholarGoogle Scholar
  42. Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-scale Image Recognition. In Proceedings of the International Conference on Learning Representations (ICLR '15).Google ScholarGoogle Scholar
  43. Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip Reading Sentences in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '17). 6447--6456.Google ScholarGoogle ScholarCross RefCross Ref
  44. R. William Soukoreff and I. Scott MacKenzie. 2003. Metrics for Text Entry Research: An Evaluation of MSD and KSPC, and a New Unified Error Metric. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '03). 113--120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Ke Sun, Chun Yu, Weinan Shi, Lan Liu, and Yuanchun Shi. 2018. Lip-interact: Improving Mobile Device Interaction with Silent Speech Commands. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (UIST '18). 581--593.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Ke Sun and Xinyu Zhang. 2021. UltraSE: Single-channel Speech Enhancement using Ultrasound. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking (MobiCom '21). 160--173.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Raphael Tang, Jaejun Lee, Afsaneh Razi, Julia Cambre, Ian Bicking, Jofish Kaye, and Jimmy Lin. 2020. Howl: A Deployed, Open-Source Wake Word Detection System. In Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS '20). Association for Computational Linguistics, 61--65.Google ScholarGoogle ScholarCross RefCross Ref
  48. Kristin J Teplansky, Brian Y Tsang, and Jun Wang. 2019. Tongue and Lip Motion Patterns in Voiced, Whispered, and Silent Vowel Production. In Proceedings of the International Congress of Phonetic Sciences (ICPhS '19). 1--5.Google ScholarGoogle Scholar
  49. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Advances in Neural Information Processing Systems (NeruIPS '17).Google ScholarGoogle Scholar
  50. Michael Wand, Christopher Schulte, Matthias Janke, and Tanja Schultz. 2013. Array-based Electromyographic Silent Speech Interface. In Biosignals. 89--96.Google ScholarGoogle Scholar
  51. Chao Wang, Feng Lin, Zhongjie Ba, Fan Zhang, Wenyao Xu, and Kui Ren. 2022. Wavesdropper: Through-wall Word Detection of Human Speech via Commercial mmWave Devices. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 6, 2 (2022), 1--26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Guanhua Wang, Yongpan Zou, Zimu Zhou, Kaishun Wu, and Lionel M. Ni. 2014. We Can Hear You with Wi-Fi!. In Proceedings of the 20th Annual International Conference on Mobile Computing and Networking (MobiCom '14). 593--604.Google ScholarGoogle Scholar
  53. Jingxian Wang, Chengfeng Pan, Haojian Jin, Vaibhav Singh, Yash Jain, Jason I Hong, Carmel Majidi, and Swarun Kumar. 2019. Rfid Tattoo: A Wireless Platform for Speech Recognition. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 3, 4 (2019), 1--24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Haowen Wei, Ziheng Li, Alexander D Galvan, Zhuoran Su, Xiao Zhang, Kaveh Pahlavan, and Erin T Solovey. 2022. IndexPen: Two-Finger Text Input with Millimeter-Wave Radar. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 6, 2 (2022), 1--39.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Li Wen, Changzhan Gu, and Jun-Fa Mao. 2020. Silent Speech Recognition based on Short-range Millimeter-wave Sensing. In Proceedings of 2020 IEEE/MTT-S International Microwave Symposium (IMS '20). 779--782.Google ScholarGoogle ScholarCross RefCross Ref
  56. Chenshu Wu, Feng Zhang, Beibei Wang, and KJ Ray Liu. 2020. mmTrack: Passive Multi-Person Localization using Commodity Millimeter wave Radio. In Proceedings of 2020 IEEE Conference on Computer Communications (INFOCOM '20). 2400--2409.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Chenhan Xu, Zhengxiong Li, Hanbin Zhang, Aditya Singh Rathore, Huining Li, Chen Song, Kun Wang, and Wenyao Xu. 2019. WaveEar: Exploring a mmWave-based Noise-resistant Speech Sensing for Voice-User Interface. In Proceedings of the 17th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys '19). 14--26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Linghan Zhang, Sheng Tan, and Jie Yang. 2017. Hearing Your Voice is not Enough: An Articulatory Gesture based Liveness Detection for Voice Authentication. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS '17). 57--71.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Qian Zhang, Dong Wang, Run Zhao, and Yinggang Yu. 2021. SoundLip: Enabling Word and Sentence-level Lip Interaction for Smart Devices. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 5, 1 (2021), 1--28.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Ruidong Zhang, Mingyang Chen, Benjamin Steeper, Yaxuan Li, Zihan Yan, Yizhuo Chen, Songyun Tao, Tuochao Chen, Hyunchul Lim, and Cheng Zhang. 2021. SpeeChin: A Smart Necklace for Silent Speech Recognition. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 5, 4 (2021), 1--23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Yongzhao Zhang, Wei-Hsiang Huang, Chih-Yun Yang, Wen-Ping Wang, Yi-Chao Chen, Chuang-Wen You, Da-Yuan Huang, Guangtao Xue, and Jiadi Yu. 2020. Endophasia: Utilizing Acoustic-based Imaging for Issuing Contact-free Silent Speech Commands. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 4, 1 (2020), 1--26.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. mSilent: Towards General Corpus Silent Speech Recognition Using COTS mmWave Radar

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
      Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies  Volume 7, Issue 1
      March 2023
      1243 pages
      EISSN:2474-9567
      DOI:10.1145/3589760
      Issue’s Table of Contents

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 28 March 2023
      Published in imwut Volume 7, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader