Abstract
Silent speech recognition (SSR) allows users to speak to the device without making a sound, avoiding being overheard or disturbing others. Compared to the video-based approach, wireless signal-based SSR can work when the user is wearing a mask and has fewer privacy concerns. However, previous wireless-based systems are still far from well-studied, e.g. they are only evaluated in corpus with highly limited size, making them only feasible for interaction with dozens of deterministic commands. In this paper, we present mSilent, a millimeter-wave (mmWave) based SSR system that can work in the general corpus containing thousands of daily conversation sentences. With the strong recognition capability, mSilent not only supports the more complex interaction with assistants, but also enables more general applications in daily life such as communication and input. To extract fine-grained articulatory features, we build a signal processing pipeline that uses a clustering-selection algorithm to separate articulatory gestures and generates a multi-scale detrended spectrogram (MSDS). To handle the complexity of the general corpus, we design an end-to-end deep neural network that consists of a multi-branch convolutional front-end and a Transformer-based sequence-to-sequence back-end. We collect a general corpus dataset of 1,000 daily conversation sentences that contains 21K samples of bi-modality data (mmWave and video). Our evaluation shows that mSilent achieves a 9.5% average word error rate (WER) at a distance of 1.5m, which is comparable to the performance of the state-of-the-art video-based approach. We also explore deploying mSilent in two typical scenarios of text entry and in-car assistant, and the less than 6% average WER demonstrates the potential of mSilent in general daily applications.
- Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2018. Deep Audio-Visual Speech Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018).Google Scholar
- Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando De Freitas. 2016. Lipnet: End-to-end Sentence-level Lipreading. arXiv preprint arXiv:1611.01599 (2016).Google Scholar
- Sue Atkins, Jeremy Clear, and Nicholas Ostler. 1992. Corpus Design Criteria. Literary and Linguistic Computing 7, 1 (1992), 1--16.Google ScholarCross Ref
- Suryoday Basak and Mahanth Gowda. 2022. mmspy: Spying Phone Calls using mmWave Radars. In Proceedings of 2022 IEEE Symposium on Security and Privacy (S&P '22). 1211--1228.Google ScholarCross Ref
- Dongjiang Cao, Ruofeng Liu, Hao Li, Shuai Wang, Wenchao Jiang, and Chris Xiaoxuan Lu. 2022. Cross Vision-RF Gait Re-Identification with Low-Cost RGB-D Cameras and MmWave Radars. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 6, 3 (2022), 1--25.Google ScholarDigital Library
- Zhe Chen, Tianyue Zheng, Chao Cai, and Jun Luo. 2021. MoVi-Fi: Motion-robust Vital Signs Waveform Recovery via Deep Interpreted RF Sensing. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking (MobiCom '21). 392--405.Google ScholarDigital Library
- Martin Cooke, Jon Barker, Stuart Cunningham, and Xu Shao. 2006. An Audio-Visual Corpus for Speech Perception and Automatic Speech Recognition. The Journal of the Acoustical Society of America 120, 5 (2006), 2421--2424.Google ScholarCross Ref
- Mihail Eric, Lakshmi Krishnan, Francois Charette, and Christopher D. Manning. 2017. Key-Value Retrieval Networks for Task-Oriented Dialogue. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue (SIGDIAL '17). 37--49.Google Scholar
- Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised Domain Adaptation by Backpropagation. In International Conference on Machine Learning (ICML '15). PMLR, 1180--1189.Google Scholar
- Yang Gao, Yincheng Jin, Jiyang Li, Seokmin Choi, and Zhanpeng Jin. 2020. EchoWhisper: Exploring an Acoustic-based Silent Speech Interface for Smartphone Users. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 4, 3 (2020), 1--27.Google ScholarDigital Library
- P Ghane, G Hossain, and A Tovar. 2015. Robust Understanding of EEG Patterns in Silent Speech. In Proceedings of 2015 National Aerospace and Electronics Conference (NAECON '15). IEEE, 282--289.Google ScholarCross Ref
- Google. 2022. Google Soli Products. https://www.atap.google.com/soli/products/Google Scholar
- Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. 2020. Conformer: Convolution-augmented Transformer for Speech Recognition. In Proceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH '20).Google ScholarCross Ref
- Unsoo Ha, Salah Assana, and Fadel Adib. 2020. Contactless Seismocardiography via Deep Learning Radars. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking (MobiCom '20). 1--14.Google ScholarDigital Library
- Uri Hadar, Timothy J Steiner, EC Grant, and F Clifford Rose. 1983. Kinematics of Head Movements Accompanying Speech during Conversation. Human Movement Science 2, 1-2 (1983), 35--46.Google ScholarCross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16). 770--778.Google ScholarCross Ref
- Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber, et al. 2001. Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-term Dependencies.Google Scholar
- Cesar Iovescu and Sandeep Rao. 2017. The Fundamentals of Millimeter Wave Sensors. Texas Instruments (2017), 1--8.Google Scholar
- Shekh MM Islam, Naoyuki Motoyama, Sergio Pacheco, and Victor M Lubecke. 2020. Non-Contact Vital Signs Monitoring for Multiple Subjects using a Millimeter Wave FMCW Automotive Radar. In Proceedings of 2020 IEEE/MTT-S International Microwave Symposium (IMS 20). 783--786.Google ScholarCross Ref
- Yincheng Jin, Yang Gao, Xuhai Xu, Seokmin Choi, Jiyang Li, Feng Liu, Zhengxiong Li, and Zhanpeng Jin. 2022. EarCommand: "Hearing" Your Silent Speech Commands In Ear. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 6, 2 (2022), 1--28.Google ScholarDigital Library
- Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2022. Scaling Laws for Neural Language Models. In Proceedings of the International Conference on Learning Representations (ICLR 22).Google Scholar
- Adam Kilgarriff and Tony Rose. 1998. Measures for Corpus Similarity and Homogeneity. In Proceedings of the Third Conference on Empirical Methods for Natural Language Processing (EMNLP '98). 46--52.Google Scholar
- Suyoun Kim, Takaaki Hori, and Shinji Watanabe. 2017. Joint CTC-attention based End-to-end Speech Recognition using Multi-task Learning. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '17). 4835--4839.Google ScholarDigital Library
- Yoon Kim and Alexander M Rush. 2016. Sequence-level Knowledge Distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP '16). The Association for Computational Linguistics, 1317--1327.Google ScholarCross Ref
- Naoki Kimura, Tan Gemicioglu, Jonathan Womack, Richard Li, Yuhui Zhao, Abdelkareem Bedri, Zixiong Su, Alex Olwal, Jun Rekimoto, and Thad Starner. 2022. SilentSpeller: Towards Mobile, Hands-free, Silent Speech Text Entry using Electropalatography. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI '22). 1--19.Google ScholarDigital Library
- Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR '15).Google Scholar
- Hao Kong, Xiangyu Xu, Jiadi Yu, Qilin Chen, Chenguang Ma, Yingying Chen, Yi-Chao Chen, and Linghe Kong. 2022. M3Track: mmWave-based Multi-User 3D Posture Tracking. In Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services (MobiSys '22). 491--503.Google ScholarDigital Library
- Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Network in Network. arXiv preprint arXiv:1312.4400 (2013).Google Scholar
- Fan Liu, Yuanhao Cui, Christos Masouros, Jie Xu, Tony Xiao Han, Yonina C Eldar, and Stefano Buzzi. 2022. Integrated Sensing and Communications: Towards dual-Functional Wireless Networks for 6G and beyond. IEEE Journal on Selected Areas in Communications (2022).Google Scholar
- Li Lu, Jiadi Yu, Yingying Chen, Hongbo Liu, Yanmin Zhu, Linghe Kong, and Minglu Li. 2019. Lip Reading-based User Authentication through Acoustic Sensing on Smartphones. IEEE/ACM Transactions on Networking 27, 1 (2019), 447--460.Google ScholarDigital Library
- Pingchuan Ma, Stavros Petridis, and Maja Pantic. 2021. End-to-end Audio-Visual Speech Recognition with Conformers. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '21). 7613--7617.Google ScholarCross Ref
- Shuang Ma, Zhaoyang Zeng, Daniel McDuff, and Yale Song. 2021. Active Contrastive Learning of Audio-Visual Video Representations. In Proceedings of the International Conference on Learning Representations (ICLR '21).Google Scholar
- Alexander Neubeck and Luc Van Gool. 2006. Efficient Non-maximum Suppression. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR 06). IEEE, 850--855.Google ScholarDigital Library
- Laxmi Pandey, Khalad Hasan, and Ahmed Sabbir Arif. 2021. Acceptability of Speech and Silent Speech Input Methods in Private and Public. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI '21). 1--13.Google ScholarDigital Library
- Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. 2019. Specaugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proceedings of the 20st Annual Conference of the International Speech Communication Association (INTERSPEECH '19).Google ScholarCross Ref
- KR Prajwal, Triantafyllos Afouras, and Andrew Zisserman. 2022. Sub-word Level Lip Reading with Visual Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR '22). 5162--5172.Google ScholarCross Ref
- Ahmed Rekik, Achraf Ben-Hamadou, and Walid Mahdi. 2016. An Adaptive Approach for Lip-reading using Image and Depth data. Multimedia Tools and Applications 75, 14 (2016), 8609--8636.Google ScholarDigital Library
- Himanshu Sahni, Abdelkareem Bedri, Gabriel Reyes, Pavleen Thukral, Zehua Guo, Thad Starner, and Maysam Ghovanloo. 2014. The Tongue and Ear Interface: a Wearable System for Silent Speech Recognition. In Proceedings of the 2014 ACM International Symposium on Wearable Computers (ISWC '14). 47--54.Google ScholarDigital Library
- Panneer Selvam Santhalingam, Al Amin Hosain, Ding Zhang, Parth Pathak, Huzefa Rangwala, and Raja Kushalnagar. 2020. mmASL: Environment-Independent ASL Gesture Recognition using 60 GHz Millimeter-wave Signals. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 4, 1 (2020), 1--30.Google ScholarDigital Library
- Neil Shah, Nirmesh J Shah, and Hemant A Patil. 2018. Effectiveness of Generative Adversarial Network for Non-Audible Murmur-to-Whisper Speech Conversion. In Proceedings of the 19st Annual Conference of the International Speech Communication Association (INTERSPEECH '18).Google ScholarCross Ref
- Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, and Abdelrahman Mohamed. 2022. Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction. In International Conference on Learning Representations (ICLR '22).Google Scholar
- Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-scale Image Recognition. In Proceedings of the International Conference on Learning Representations (ICLR '15).Google Scholar
- Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip Reading Sentences in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '17). 6447--6456.Google ScholarCross Ref
- R. William Soukoreff and I. Scott MacKenzie. 2003. Metrics for Text Entry Research: An Evaluation of MSD and KSPC, and a New Unified Error Metric. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '03). 113--120.Google ScholarDigital Library
- Ke Sun, Chun Yu, Weinan Shi, Lan Liu, and Yuanchun Shi. 2018. Lip-interact: Improving Mobile Device Interaction with Silent Speech Commands. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (UIST '18). 581--593.Google ScholarDigital Library
- Ke Sun and Xinyu Zhang. 2021. UltraSE: Single-channel Speech Enhancement using Ultrasound. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking (MobiCom '21). 160--173.Google ScholarDigital Library
- Raphael Tang, Jaejun Lee, Afsaneh Razi, Julia Cambre, Ian Bicking, Jofish Kaye, and Jimmy Lin. 2020. Howl: A Deployed, Open-Source Wake Word Detection System. In Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS '20). Association for Computational Linguistics, 61--65.Google ScholarCross Ref
- Kristin J Teplansky, Brian Y Tsang, and Jun Wang. 2019. Tongue and Lip Motion Patterns in Voiced, Whispered, and Silent Vowel Production. In Proceedings of the International Congress of Phonetic Sciences (ICPhS '19). 1--5.Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Advances in Neural Information Processing Systems (NeruIPS '17).Google Scholar
- Michael Wand, Christopher Schulte, Matthias Janke, and Tanja Schultz. 2013. Array-based Electromyographic Silent Speech Interface. In Biosignals. 89--96.Google Scholar
- Chao Wang, Feng Lin, Zhongjie Ba, Fan Zhang, Wenyao Xu, and Kui Ren. 2022. Wavesdropper: Through-wall Word Detection of Human Speech via Commercial mmWave Devices. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 6, 2 (2022), 1--26.Google ScholarDigital Library
- Guanhua Wang, Yongpan Zou, Zimu Zhou, Kaishun Wu, and Lionel M. Ni. 2014. We Can Hear You with Wi-Fi!. In Proceedings of the 20th Annual International Conference on Mobile Computing and Networking (MobiCom '14). 593--604.Google Scholar
- Jingxian Wang, Chengfeng Pan, Haojian Jin, Vaibhav Singh, Yash Jain, Jason I Hong, Carmel Majidi, and Swarun Kumar. 2019. Rfid Tattoo: A Wireless Platform for Speech Recognition. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 3, 4 (2019), 1--24.Google ScholarDigital Library
- Haowen Wei, Ziheng Li, Alexander D Galvan, Zhuoran Su, Xiao Zhang, Kaveh Pahlavan, and Erin T Solovey. 2022. IndexPen: Two-Finger Text Input with Millimeter-Wave Radar. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 6, 2 (2022), 1--39.Google ScholarDigital Library
- Li Wen, Changzhan Gu, and Jun-Fa Mao. 2020. Silent Speech Recognition based on Short-range Millimeter-wave Sensing. In Proceedings of 2020 IEEE/MTT-S International Microwave Symposium (IMS '20). 779--782.Google ScholarCross Ref
- Chenshu Wu, Feng Zhang, Beibei Wang, and KJ Ray Liu. 2020. mmTrack: Passive Multi-Person Localization using Commodity Millimeter wave Radio. In Proceedings of 2020 IEEE Conference on Computer Communications (INFOCOM '20). 2400--2409.Google ScholarDigital Library
- Chenhan Xu, Zhengxiong Li, Hanbin Zhang, Aditya Singh Rathore, Huining Li, Chen Song, Kun Wang, and Wenyao Xu. 2019. WaveEar: Exploring a mmWave-based Noise-resistant Speech Sensing for Voice-User Interface. In Proceedings of the 17th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys '19). 14--26.Google ScholarDigital Library
- Linghan Zhang, Sheng Tan, and Jie Yang. 2017. Hearing Your Voice is not Enough: An Articulatory Gesture based Liveness Detection for Voice Authentication. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS '17). 57--71.Google ScholarDigital Library
- Qian Zhang, Dong Wang, Run Zhao, and Yinggang Yu. 2021. SoundLip: Enabling Word and Sentence-level Lip Interaction for Smart Devices. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 5, 1 (2021), 1--28.Google ScholarDigital Library
- Ruidong Zhang, Mingyang Chen, Benjamin Steeper, Yaxuan Li, Zihan Yan, Yizhuo Chen, Songyun Tao, Tuochao Chen, Hyunchul Lim, and Cheng Zhang. 2021. SpeeChin: A Smart Necklace for Silent Speech Recognition. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 5, 4 (2021), 1--23.Google ScholarDigital Library
- Yongzhao Zhang, Wei-Hsiang Huang, Chih-Yun Yang, Wen-Ping Wang, Yi-Chao Chen, Chuang-Wen You, Da-Yuan Huang, Guangtao Xue, and Jiadi Yu. 2020. Endophasia: Utilizing Acoustic-based Imaging for Issuing Contact-free Silent Speech Commands. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 4, 1 (2020), 1--26.Google ScholarDigital Library
Index Terms
- mSilent: Towards General Corpus Silent Speech Recognition Using COTS mmWave Radar
Recommendations
The tongue and ear interface: a wearable system for silent speech recognition
ISWC '14: Proceedings of the 2014 ACM International Symposium on Wearable ComputersWe address the problem of performing silent speech recognition where vocalized audio is not available (e.g. due to a user's medical condition) or is highly noisy (e.g. during firefighting or combat). We describe our wearable system to capture tongue and ...
Analyzing lower half facial gestures for lip reading applications: Survey on vision techniques
AbstractLip reading has gained popularity due to the proliferation of emerging real-world applications. This article provides a comprehensive review of benchmark datasets available for lip-reading applications and pioneering works that analyze ...
Highlights- Review of available benchmark datasets for lip-reading applications.
- The growth ...
LipType: A Silent Speech Recognizer Augmented with an Independent Repair Model
CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing SystemsSpeech recognition is unreliable in noisy places, compromises privacy and security when around strangers, and inaccessible to people with speech disorders. Lip reading can mitigate many of these challenges but the existing silent speech recognizers for ...
Comments