research-article

I Am an Earphone and I Can Hear My User’s Face: Facial Landmark Tracking Using Smart Earphones

Authors:
Shijia Zhang

Penn State University, USA

Penn State University, USA

0000-0001-7304-4571
View Profile

,
Taiting Lu

Penn State University, USA

Penn State University, USA

0000-0002-9695-3142
View Profile

,
Hao Zhou

Penn State University, USA

Penn State University, USA

0000-0001-7339-9006
View Profile

,
Yilin Liu

Penn State University, USA

Penn State University, USA

0000-0003-4322-1818
View Profile

,
Runze Liu

Penn State University, USA

Penn State University, USA

0000-0003-2342-1644
View Profile

,
Mahanth Gowda

Penn State University, USA

Penn State University, USA

0000-0001-5325-5013
View Profile

Authors Info & Claims

ACM Transactions on Internet of Things Volume 5 Issue 1Article No.: 1pp 1–29https://doi.org/10.1145/3614438

Published:16 December 2023Publication History

ACM Transactions on Internet of Things

Abstract

This article presents EARFace, a system that shows the feasibility of tracking facial landmarks for 3D facial reconstruction using in-ear acoustic sensors embedded within smart earphones. This enables a number of applications in the areas of facial expression tracking, user interfaces, AR/VR applications, affective computing, and accessibility, among others. Although conventional vision-based solutions break down under poor lighting and occlusions, and also suffer from privacy concerns, earphone platforms are robust to ambient conditions while being privacy-preserving. In contrast to prior work on earable platforms that perform outer-ear sensing for facial motion tracking, EARFace shows the feasibility of completely in-ear sensing with a natural earphone form factor, thus enhancing the comfort levels of wearing. The core intuition exploited by EARFace is that the shape of the ear canal changes due to the movement of facial muscles during facial motion. EARFace tracks the changes in shape of the ear canal by measuring ultrasonic channel frequency response of the inner ear, ultimately resulting in tracking of the facial motion. A transformer-based machine learning model is designed to exploit spectral and temporal relationships in the ultrasonic channel frequency response data to predict the facial landmarks of the user with an accuracy of 1.83 mm. Using these predicted landmarks, a 3D graphical model of the face that replicates the precise facial motion of the user is then reconstructed. Domain adaptation is further performed by adapting the weights of layers using a group-wise and differential learning rate. This decreases the training overhead in EARFace. The transformer-based machine learning model runs on smart phone devices with a processing latency of 13 ms and an overall low power consumption profile. Finally, usability studies indicate higher levels of comforts of wearing EARFace’s earphone platform in comparison with alternative form factors.

REFERENCES

[1] Apple. 2023 Airpods. Retrieved August 12, 2023 from https://www.apple.com/airpods/Google Scholar
[2] Rebecca. 2020. What Are True Wireless Stereo (TWS) Headphones? Retrieved August 12, 2023 from https://blog.taotronics.com/headphones/tws-headphones/Google Scholar
[3] Fine Art America. 2023. Facial Muscle Anatomy. Retrieved August 12, 2023 from https://fineartamerica.com/featured/face-muscle-anatomy-maurizio-de-angelisscience-photo-library.htmlGoogle Scholar
[4] Shennan Jennifer. n.d. Muscles of Facial Expression. Retrieved August 12, 2023 from https://geekymedics.com/muscles-of-facial-expression/Google Scholar
[5] Android for Developers. n.d. Profile Battery Usage with Batterystats and Battery Historian. Retrieved August 12, 2023 from https://developer.android.com/topic/performance/power/setup-battery-historianGoogle Scholar
[6] VRU. n.d. Sine Sweep Test. Retrieved August 12, 2023 from https://vru.vibrationresearch.com/lesson/sine-sweep-test/Google Scholar
[7] CBC. 2023. The Seven Universal Emotions We Wear on Our Face. Retrieved August 12, 2023 from https://www.cbc.ca/natureofthings/features/the-seven-universal-emotions-we-wear-on-our-face#Google Scholar
[8] Espressif Systems. 2023. ESP32. Retrieved August 12, 2023 from https://www.espressif.com/en/products/socs/esp32Google Scholar
[9] Centers for Disease Control and Prevention. 1998. Criteria for a Recommended Standard: Occupational Noise Exposure. Retrieved August 12, 2023 from https://www.cdc.gov/niosh/docs/98-126/default.htmlGoogle Scholar
[10] Sonion. 2022. EST65DB01. Retrieved August 12, 2023 from https://www.sonion.com/product/est65da01/Google Scholar
[11] Sonion. 2022. Data Sheet: Microphone P11AC03. Retrieved August 12, 2023 from https://www.sonion.com/wp-content/uploads/ds-P11AC03_v3.pdfGoogle Scholar
[12] SciPy Community. 2022. scipy.signal.chirp. Retrieved August 12, 2023 from https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.chirp.htmlGoogle Scholar
[13] Texas Instruments. 2022. Low Power, Single-Supply, Rail-to-Rail Operational Amplifiers: MicroAmplifier Series. Retrieved August 12, 2023 from https://www.ti.com/lit/ds/symlink/opa344.pdfGoogle Scholar
[14] PJRC. 2022. Audio Adaptor Boards for Teensy 3.x and Teensy 4.x. Retrieved August 12, 2023 from https://www.pjrc.com/store/teensy3_audio.htmlGoogle Scholar
[15] PJRC. 2022. Teensyduino. Retrieved August 12, 2023 from https://www.pjrc.com/teensy/teensyduino.htmlGoogle Scholar
[16] PJRC. 2022. Teensy 4.1 Development Board. Retrieved August 12, 2023 from https://www.pjrc.com/store/teensy41.htmlGoogle Scholar
[17] GitHub. 2022. WiFiNINA Library for Arduino. Retrieved August 12, 2023 from https://github.com/adafruit/WiFiNINAGoogle Scholar
[18] Abadi Martín et al. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’16). 265–283.Google Scholar
[19] Ameri Shideh Kabiri, Kim Myungsoo, Kuang Irene Agnes, Perera Withanage K., Alshiekh Mohammed, Jeong Hyoyoung, Topcu Ufuk, Akinwande Deji, and Lu Nanshu. 2018. Imperceptible electrooculography graphene sensor system for human–robot interface. npj 2D Materials and Applications 2, 1 (2018), 1–7.Google ScholarCross Ref
[20] Amesaka Takashi, Watanabe Hiroki, and Sugimoto Masanori. 2019. Facial expression recognition using ear canal transfer function. In Proceedings of the 23rd International Symposium on Wearable Computers. 1–9.Google ScholarDigital Library
[21] An Sungtae, Medda Alessio, Sawka Michael N., Hutto Clayton J., Millard-Stafford Mindy L., Appling Scott, Richardson Kristine L. S., and Inan Omer T.. 2021. AdaptNet: Human activity recognition via bilateral domain adaptation using semi-supervised deep translation networks. IEEE Sensors Journal 21, 18 (2021), 20398–20411.Google ScholarCross Ref
[22] Ando Toshiyuki, Kubo Yuki, Shizuki Buntarou, and Takahashi Shin. 2017. CanalSense: Face-related movement recognition system based on sensing air pressure in ear canals. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology. 679–689.Google ScholarDigital Library
[23] Badrinarayanan Vijay, Kendall Alex, and Cipolla Roberto. 2017. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 12 (2017), 2481–2495.Google ScholarCross Ref
[24] Baltrusaitis Tadas, Zadeh Amir, Lim Yao Chong, and Morency Louis-Philippe. 2018. OpenFace 2.0: Facial behavior analysis toolkit. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face and Gesture Recognition (FG’18). IEEE, Los Alamitos, CA, 59–66.Google ScholarDigital Library
[25] Bodini Matteo. 2019. A review of facial landmark extraction in 2D images and videos using deep learning. Big Data and Cognitive Computing 3, 1 (2019), 14.Google ScholarCross Ref
[26] Botros Fady S., Phinyomark Angkoon, and Scheme Erik J.. 2022. Day-to-day stability of wrist EMG for wearable-based hand gesture recognition. IEEE Access 10 (2022), 125942–125954.Google ScholarCross Ref
[27] Bui Nam, Pham Nhat, Barnitz Jessica Jacqueline, Zou Zhanan, Nguyen Phuc, Truong Hoang, Kim Taeho, Farrow Nicholas, Nguyen Anh, Xiao Jianliang, Robin Deterding, Thang Dinh, and Tam Vu. 2019. eBP: A wearable system for frequent and comfortable blood pressure monitoring from user’s ear. In Proceedings of the 25th Annual International Conference on Mobile Computing and Networking. 1–17.Google ScholarDigital Library
[28] Cao Chen, Weng Yanlin, Zhou Shun, Tong Yiying, and Zhou Kun. 2013. FaceWarehouse: A 3D facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics 20, 3 (2013), 413–425.Google Scholar
[29] Cao Gaoshuai, Yuan Kuang, Xiong Jie, Yang Panlong, Yan Yubo, Zhou Hao, and Li Xiang-Yang. 2020. Earphonetrack: Involving earphones into the ecosystem of acoustic motion tracking. In Proceedings of the 18th Conference on Embedded Networked Sensor Systems. 95–108.Google ScholarDigital Library
[30] Cao Zhe, Simon Tomas, Wei Shih-En, and Sheikh Yaser. 2017. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7291–7299.Google ScholarCross Ref
[31] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’16). 4960–4964.Google ScholarDigital Library
[32] Chen Chien-Hsu, Lee I.-Jui, and Lin Ling-Yi. 2015. Augmented reality-based self-facial modeling to promote the emotional expression and social skills of adolescents with autism spectrum disorders. Research in Developmental Disabilities 36 (2015), 396–403.Google ScholarCross Ref
[33] Chen Chien-Hsu, Lee I.-Jui, and Lin Ling-Yi. 2016. Augmented reality-based video-modeling storybook of nonverbal facial cues for children with autism spectrum disorder to improve their perceptions and judgments of facial expressions and emotions. Computers in Human Behavior 55 (2016), 477–485.Google ScholarDigital Library
[34] Chen Tuochao, Steeper Benjamin, Alsheikh Kinan, Tao Songyun, Guimbretière François, and Zhang Cheng. 2020. C-Face: Continuously reconstructing facial expressions by deep learning contours of the face with ear-mounted miniature cameras. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. 112–125.Google ScholarDigital Library
[35] Choi Seokmin, Gao Yang, Jin Yincheng, Kim Se Jun, Li Jiyang, Xu Wenyao, and Jin Zhanpeng. 2022. PPGface: Like what you are watching? Earphones can “feel” your facial expressions. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 2 (2022), 1–32.Google ScholarDigital Library
[36] Choudhury Romit Roy. 2021. Earable computing: A new area to think about. In Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications. 147–153.Google ScholarDigital Library
[37] Streamable. 2022. Demo. Retrieved August 12, 2023 from https://streamable.com/t34w8lGoogle Scholar
[38] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09).Google Scholar
[39] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
[40] Dong Xuanyi, Yan Yan, Ouyang Wanli, and Yang Yi. 2018. Style aggregated network for facial landmark detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 379–388.Google ScholarCross Ref
[41] Dosovitskiy Alexey, Beyer Lucas, Kolesnikov Alexander, Weissenborn Dirk, Zhai Xiaohua, Unterthiner Thomas, Dehghani Mostafa, Minderer Matthias, Heigold Georg, Gelly Sylvain, Jacob Uszkoreit, and Neil Houlsby. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).Google Scholar
[42] Gao Yang, Jin Yincheng, Choi Seokmin, Li Jiyang, Pan Junjie, Shu Lin, Zhou Chi, and Jin Zhanpeng. 2021. SonicFace: Tracking facial expressions using a commodity microphone array. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 4 (2021), 1–33.Google ScholarDigital Library
[43] TensorFlow. 2019. Home Page. Retrieved August 12, 2023 from https://www.tensorflow.org/liteGoogle Scholar
[44] Goverdovsky Valentin, Rosenberg Wilhelm Von, Nakamura Takashi, Looney David, Sharp David J., Papavassiliou Christos, Morrell Mary J., and Mandic Danilo P.. 2017. Hearables: Multimodal physiological in-ear sensing. Scientific Reports 7, 1 (2017), 1–10.Google ScholarCross Ref
[45] Xiaobing Han, Yanfei Zhong, Liqin Cao, and Liangpei Zhang. 2017. Pre-trained AlexNet architecture with pyramid pooling and supervision for high spatial resolution remote sensing image scene classification. Remote Sensing 9, 8 (2017), 848.Google Scholar
[46] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google ScholarCross Ref
[47] He Sen, Liao Wentong, Tavakoli Hamed R., Yang Michael, Rosenhahn Bodo, and Pugeault Nicolas. 2020. Image captioning through image transformer. In Proceedings of the Asian Conference on Computer Vision.Google Scholar
[48] Hiipakka Marko, Tikander Miikka, and Karjalainen Matti. 2010. Modeling the external ear acoustics for insert headphone usage. Journal of the Audio Engineering Society 58 (2010), 269–281.Google Scholar
[49] Howard Jeremy and Ruder Sebastian. 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 (2018).Google Scholar
[50] Hu Rong, Chen Ling, Miao Shenghuan, and Tang Xing. 2023. SWL-Adapt: An unsupervised domain adaptation model with sample weight learning for cross-user wearable human activity recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 6012–6020.Google ScholarDigital Library
[51] Ishii Shun, Yokokubo Anna, Luimula Mika, and Lopez Guillaume. 2020. ExerSense: Physical exercise recognition and counting algorithm from wearables robust to positioning. Sensors 21, 1 (2020), 91.Google ScholarCross Ref
[52] Islam Md. Shafiqul, Hossain Tahera, Ahad Md. Atiqur Rahman, and Inoue Sozo. 2021. Exploring human activities using eSense earable device. In Activity and Behavior Computing. Springer, 169–185.Google ScholarCross Ref
[53] Izmailov Pavel, Podoprikhin Dmitrii, Garipov Timur, Vetrov Dmitry, and Wilson Andrew Gordon. 2018. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407 (2018).Google Scholar
[54] Jin Yincheng, Gao Yang, Xu Xuhai, Choi Seokmin, Li Jiyang, Liu Feng, Li Zhengxiong, and Jin Zhanpeng. 2022. EarCommand: “Hearing” your silent speech commands in ear. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 2 (2022), 1–28.Google ScholarDigital Library
[55] Katsigiannis Stamos and Ramzan Naeem. 2017. DREAMER: A database for emotion recognition through EEG and ECG signals from wireless low-cost off-the-shelf devices. IEEE Journal of Biomedical and Health Informatics 22, 1 (2017), 98–107.Google ScholarCross Ref
[56] Kawsar Fahim, Min Chulhong, Mathur Akhil, Montanari Alessandro, Acer Utku Günay, and Broeck Marc Van den. 2018. eSense: Open earable platform for human sensing. In Proceedings of the 16th ACM Conference on Embedded Networked Sensor Systems. 371–372.Google ScholarDigital Library
[57] Kennedy Daniel P. and Adolphs Ralph. 2012. Perception of emotions from facial expressions in high-functioning adults with autism. Neuropsychologia 50, 14 (2012), 3313–3319.Google ScholarCross Ref
[58] Khan Md. Abdullah Al Hafiz, Roy Nirmalya, and Misra Archan. 2018. Scaling human activity recognition via deep learning-based domain adaptation. In Proceedings of the 2018 IEEE International Conference on Pervasive Computing and Communications (PerCom’18). IEEE, Los Alamitos, CA, 1–9.Google ScholarCross Ref
[59] Khanna Prerna, Srivastava Tanmay, Pan Shijia, Jain Shubham, and Nguyen Phuc. 2021. JawSense: Recognizing unvoiced sound using a low-cost ear-worn system. In Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications. 44–49.Google ScholarDigital Library
[60] Kim Jaeyoung, El-Khamy Mostafa, and Lee Jungwon. 2020. T-GSA: Transformer with Gaussian-weighted self-attention for speech enhancement. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’20). IEEE, Los Alamitos, CA, 6649–6653.Google ScholarCross Ref
[61] Kingma Diederik P. and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
[62] Kinnunen Tomi, Evgenia Chernenko, Marko Tuononen, Pasi Franti, and Haizhou Li. 2012. Voice activity detection using MFCC features and support vector machine. In Proceedings of the XII International Conference on Speech and Computer (SPECOM’07).Google Scholar
[63] Krizhevsky Alex, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12). 1097–1105.Google Scholar
[64] Kwon Hyeokhyen, Tong Catherine, Haresamudram Harish, Gao Yan, Abowd Gregory D., Lane Nicholas D., and Ploetz Thomas. 2020. IMUTube: Automatic extraction of virtual on-body accelerometry from video for human activity recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 4, 3 (2020), 1–29.Google ScholarDigital Library
[65] Lattas Alexandros, Moschoglou Stylianos, Gecer Baris, Ploumpis Stylianos, Triantafyllou Vasileios, Ghosh Abhijeet, and Zafeiriou Stefanos. 2020. AvatarMe: Realistically renderable 3D facial reconstruction “in-the-wild.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 760–769.Google ScholarCross Ref
[66] Lee Seulki, Islam Bashima, Luo Yubo, and Nirjon Shahriar. 2019. Intermittent learning: On-device machine learning on intermittently powered system. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 3, 4 (2019), 1–30.Google ScholarDigital Library
[67] Lee Won-Sook, Kalra Prem, and Magnenat-Thalmann Nadia. 1997. Model based face reconstruction for animation. In Proceedings of the International Multimedia Modeling Conference (MMM’97). 323–338.Google Scholar
[68] Li Guang, Zhu Linchao, Liu Ping, and Yang Yi. 2019. Entangled transformer for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8928–8937.Google ScholarCross Ref
[69] Li Ke, Zhang Ruidong, Liang Bo, Guimbretière François, and Zhang Cheng. 2022. EarIO: A low-power acoustic sensing earable for continuously tracking detailed facial movements. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 2 (2022), 1–24.Google ScholarDigital Library
[70] Li Tianye, Bolkart Timo, Black Michael J., Li Hao, and Romero Javier. 2017. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics 36, 6 (2017), Article 194, 17 pages.Google ScholarDigital Library
[71] Liang Zhengyu, Wang Yingqian, Wang Longguang, Yang Jungang, and Zhou Shilin. 2022. Light field image super-resolution with transformers. IEEE Signal Processing Letters 29 (2022), 563–567.Google ScholarCross Ref
[72] Liu Yilin, Zhang Shijia, and Gowda Mahanth. 2021. NeuroPose: 3D hand pose tracking using EMG wearables. In Proceedings of the Web Conference 2021. 1471–1482.Google ScholarDigital Library
[73] Liu Yilin, Zhang Shijia, and Gowda Mahanth. 2021. When video meets inertial sensors: Zero-shot domain adaptation for finger motion analytics with inertial sensors. In Proceedings of the International Conference on Internet-of-Things Design and Implementation. 182–194.Google ScholarDigital Library
[74] Liu Yilin, Zhang Shijia, and Gowda Mahanth. 2022. A practical system for 3-D hand pose tracking using EMG wearables with applications to prosthetics and user interfaces. IEEE Internet of Things Journal 10, 4 (2022), 3407–3427.Google ScholarCross Ref
[75] Lotfi Roya, Tzanetakis George, Eskicioglu Rasit, and Irani Pourang. 2020. A comparison between audio and IMU data to detect chewing events based on an earable device. In Proceedings of the 11th Augmented Human International Conference. 1–8.Google ScholarDigital Library
[76] Luo Yubo and Huang Yongfeng. 2017. Text steganography with high embedding rate: Using recurrent neural networks to generate Chinese classic poetry. In Proceedings of the 5th ACM Workshop on Information Hiding and Multimedia Security. 99–104.Google ScholarDigital Library
[77] Luo Yubo, Huang Yongfeng, Li Fufang, and Chang Chinchen. 2016. Text steganography based on Ci-poetry generation using Markov chain model. KSII Transactions on Internet and Information Systems 10, 9 (2016), 4568–4584.Google Scholar
[78] Luo Yubo, Zhang Le, Wang Zhenyu, and Nirjon Shahriar. 2023. Efficient multitask learning on resource-constrained systems. arXiv preprint arXiv:2302.13155 (2023).Google Scholar
[79] Maag Balz, Zhou Zimu, Saukh Olga, and Thiele Lothar. 2017. BARTON: Low power tongue movement sensing with in-ear barometers. In Proceedings of the 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS’17). IEEE, Los Alamitos, CA, 9–16.Google ScholarCross Ref
[80] Masai Katsutoshi, Kunze Kai, Sakamoto Daisuke, Sugiura Yuta, and Sugimoto Maki. 2020. Face commands-user-defined facial gestures for smart glasses. In Proceedings of the 2020 IEEE International Symposium on Mixed and Augmented Reality (ISMAR’20). IEEE, Los Alamitos, CA, 374–386.Google ScholarCross Ref
[81] Masai Katsutoshi, Sugiura Yuta, Ogata Masa, Kunze Kai, Inami Masahiko, and Sugimoto Maki. 2016. Facial expression recognition in daily life by embedded photo reflective sensors on smart eyewear. In Proceedings of the 21st International Conference on Intelligent User Interfaces. 317–326.Google ScholarDigital Library
[82] Matthies Denys J. C., Strecker Bernhard A., and Urban Bodo. 2017. EarFieldSensing: A novel in-ear electric field sensing to enrich wearable gesture input through facial expressions. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. 1911–1922.Google ScholarDigital Library
[83] Matthies Denys J. C., Strecker Bernhard A., and Urban Bodo. 2017. EarFieldSensing: A novel in-ear electric field sensing to enrich wearable gesture input through facial expressions. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. 1911–1922.Google ScholarDigital Library
[84] Matthies Denys J. C., Weerasinghe Chamod, Urban Bodo, and Nanayakkara Suranga. 2021. CapGlasses: Untethered capacitive sensing with smart glasses. In Proceedings of the 2021 Augmented Humans Conference. 121–130.Google ScholarDigital Library
[85] Mohamed Abdelrahman, Okhonko Dmytro, and Zettlemoyer Luke. 2019. Transformers with convolutional context for ASR. arXiv preprint arXiv:1904.11660 (2019).Google Scholar
[86] Murata Aiko, Saito Hisamichi, Schug Joanna, Ogawa Kenji, and Kameda Tatsuya. 2016. Spontaneous facial mimicry is enhanced by the goal of inferring emotional states: Evidence for moderation of “automatic” mimicry by higher cognitive processes. PLoS One 11, 4 (2016), e0153128.Google ScholarCross Ref
[87] Nawaz Wajahat, Sagheer Ahmed, Ali Tahir, and Hassan Aqeel Khan. 2018. Classification of breast cancer histology images using AlexNet. In Image Analysis and Recognition. Lecture Notes in Computer Science, Vol. 10882. Springer, 869–876.Google Scholar
[88] Abatement U.S. Office of Noise. 1974. Information on Levels of Environmental Noise Requisite to Protect Public Health and Welfare with an Adequate Margin of Safety. Number 2115. U.S. Government Printing Office.Google Scholar
[89] OnePlus. 2023. OnePlus 9 Pro. Retrieved August 12, 2023 from https://www.oneplus.com/us/9-proGoogle Scholar
[90] Pavlakos Georgios, Choutas Vasileios, Ghorbani Nima, Bolkart Timo, Osman Ahmed A. A., Tzionas Dimitrios, and Black Michael J.. 2019. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10975–10985.Google ScholarCross Ref
[91] Paysan Pascal, Knothe Reinhard, Amberg Brian, Romdhani Sami, and Vetter Thomas. 2009. A 3D face model for pose and illumination invariant face recognition. In Proceedings of the 2009 6th IEEE International Conference on Advanced Video and Signal Based Surveillance. IEEE, Los Alamitos, CA, 296–301.Google ScholarDigital Library
[92] Prakash Jay, Yang Zhijian, Wei Yu-Lin, Hassanieh Haitham, and Choudhury Romit Roy. 2020. EarSense: Earphones as a teeth activity sensor. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking. 1–13.Google ScholarDigital Library
[93] Qi JunRong. 2016. Cross-Correlation between Mandibular Condylar Movements and Distortion of External Auditory Meatus.Ph. D. dissertation. Matsumoto Dental University. https://ci.nii.ac.jp/naid/500000981228Google Scholar
[94] Qu Chen, Liu Yang, Minghui Qiu, W. Bruce Croft, Yongfeng Zhang, and Mohit Iyyer. 2019. BERT with history answer embedding for conversational question answering. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. 1133–1136.Google ScholarDigital Library
[95] Rabhi Yassine, Mrabet Makrem, and Fnaiech Farhat. 2018. A facial expression controlled wheelchair for people with disabilities. Computer Methods and Programs in Biomedicine 165 (2018), 89–105.Google ScholarCross Ref
[96] Samsung. n.d. Africa: Samsung Galaxy Note 20. https://www.samsung.com/africa_en/smartphones/galaxy-note20/models/Google Scholar
[97] Sanyal Soubhik, Bolkart Timo, Feng Haiwen, and Black Michael J.. 2019. Learning to regress 3D face shape and expression from an image without 3D supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7763–7772.Google ScholarCross Ref
[98] Shen Fangyao, Peng Yong, Kong Wanzeng, and Dai Guojun. 2021. Multi-scale frequency bands ensemble learning for EEG-based emotion recognition. Sensors 21, 4 (2021), 1262.Google ScholarCross Ref
[99] Song Xingzhe, Huang Kai, and Gao Wei. 2022. FaceListener: Recognizing human facial expressions via acoustic sensing on commodity headphones. In Proceedings of the 2022 21st ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN’22).Google Scholar
[100] Srivastava Tanmay, Khanna Prerna, Pan Shijia, Nguyen Phuc, and Jain Shubham. 2022. MuteIt: Jaw motion based unvoiced command recognition using earable. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 3 (2022), 1–26.Google ScholarDigital Library
[101] Stoffregen Paul. 2022. Teensy Audio Implementation Library. Retrieved August 12, 2023 from https://github.com/PaulStoffregen/AudioGoogle Scholar
[102] Tagliasacchi Marco, Li Yunpeng, Misiunas Karolis, and Roblek Dominik. 2020. SEANet: A multi-modal speech enhancement network. arXiv preprint arXiv:2009.02095 (2020).Google Scholar
[103] Taigman Yaniv, Yang Ming, Ranzato Marc’Aurelio, and Wolf Lior. 2014. DeepFace: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1701–1708.Google ScholarDigital Library
[104] Takayuki Arakawa. 2019. Ear acoustic authentication technology: Using sound to identify the distinctive shape of the ear canal. NEC Technical Journal—Special Issue on Social Value Creation Using Biometrics 13, 2 (2019), 87–90.Google Scholar
[105] Tesfamikael Hadish Habte, Fray Adam, Mengsteab Israel, Semere Adonay, and Amanuel Zebib. 2021. Simulation of eye tracking control based electric wheelchair construction by image segmentation algorithm. Journal of Innovative Image Processing 3, 01 (2021), 21–35.Google ScholarCross Ref
[106] Torvi Vishwas G., Bhattacharya Aditya, and Chakraborty Shayok. 2018. Deep domain adaptation to predict freezing of gait in patients with Parkinson’s disease. In Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA’18). IEEE, Los Alamitos, CA, 1001–1006.Google Scholar
[107] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, Guyon I., Luxburg U. V., Bengio S., Wallach H., Fergus R., Vishwanathan S., and Garnett R. (Eds.), Vol. 30. Curran Associates, 1–11. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdfGoogle Scholar
[108] Verma Dhruv, Bhalla Sejal, Sahnan Dhruv, Shukla Jainendra, and Parnami Aman. 2021. ExpressEar: Sensing fine-grained facial expressions with earables. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 3 (2021), 1–28.Google ScholarDigital Library
[109] Villanueva Pedro. 2022. Teensy ADC Implementation Library. Retrieved August 12, 2023 from https://github.com/pedvide/ADCGoogle Scholar
[110] Wager Stefan, Sida Wang, and Percy Liang. 2013. Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems.Google Scholar
[111] White Susan W., Mazefsky Carla A., Dichter Gabriel S., Chiu Pearl H., Richey John A., and Ollendick Thomas H.. 2014. Social-cognitive, physiological, and neural mechanisms underlying emotion regulation impairments: Understanding anxiety in autism spectrum disorder. International Journal of Developmental Neuroscience 39 (2014), 22–36.Google ScholarCross Ref
[112] Wu Yi, Kakaraparthi Vimal, Li Zhuohang, Pham Tien, Liu Jian, and Nguyen Phuc. 2021. BioFace-3D: Continuous 3D facial reconstruction through lightweight single-ear biosensors. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking. 350–363.Google ScholarDigital Library
[113] Xiong Xuehan and Torre Fernando De la. 2013. Supervised descent method and its applications to face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 532–539.Google ScholarDigital Library
[114] Yang Fuzhi, Yang Huan, Fu Jianlong, Lu Hongtao, and Guo Baining. 2020. Learning texture transformer network for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5791–5800.Google ScholarCross Ref
[115] Yang Zhijian, Wei Yu-Lin, Shen Sheng, and Choudhury Romit Roy. 2020. Ear-AR: Indoor acoustic augmented reality on earphones. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking. 1–14.Google ScholarDigital Library
[116] Yu Weiwei, Zhou Jian, Wang HuaBin, and Tao Liang. 2022. SETransformer: Speech enhancement transformer. Cognitive Computation 14, 3 (2022), 1152–1158.Google ScholarCross Ref
[117] Zhang Haoyu, Xu Jianjun, and Wang Ji. 2019. Pretraining-based natural language generation for text summarization. arXiv preprint arXiv:1902.09243 (2019).Google Scholar
[118] Zhang Ligang and Tjondronegoro Dian. 2011. Facial expression recognition using facial movement features. IEEE Transactions on Affective Computing 2, 4 (2011), 219–229. DOI:Google ScholarDigital Library
[119] Zhang Shijia, Liu Yilin, and Gowda Mahanth. 2022. Let’s grab a drink: Teacher-student learning for fluid intake monitoring using smart earphones. In Proceedings of the 2022 IEEE/ACM 7th International Conference on Internet-of-Things Design and Implementation (IoTDI’22). IEEE, Los Alamitos, CA, 55–66.Google Scholar
[120] Zhang Shijia, Liu Yilin, and Gowda Mahanth. 2023. I spy you: Eavesdropping continuous speech on smartphones via motion sensors. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 4 (2023), 1–31.Google Scholar
[121] Zhang Tianyi, Wu Felix, Katiyar Arzoo, Weinberger Kilian Q., and Artzi Yoav. 2020. Revisiting few-sample BERT fine-tuning. arXiv preprint arXiv:2006.05987 (2020).Google Scholar
[122] Zhou Xinhui, Garcia-Romero Daniel, Duraiswami Ramani, Espy-Wilson Carol, and Shamma Shihab. 2011. Linear versus mel frequency cepstral coefficients for speaker recognition. In Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE, Los Alamitos, CA, 559–564.Google ScholarCross Ref
[123] Zhou Zongwei, Jae Shin, Lei Zhang, Suryakanth Gurudu, Michael Gotway, and Jianming Liang. 2017. Fine-tuning convolutional neural networks for biomedical image analysis: Actively and incrementally. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarCross Ref
[124] Zhu Shizhan, Li Cheng, Loy Chen Change, and Tang Xiaoou. 2015. Face alignment by coarse-to-fine shape searching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 4998–5006.Google Scholar

Index Terms

I Am an Earphone and I Can Hear My User’s Face: Facial Landmark Tracking Using Smart Earphones
1. Hardware
  1. Printed circuit boards
    1. PCB design and layout
2. Human-centered computing
  1. Ubiquitous and mobile computing

Recommendations

Facial Landmark Detection and Tracking for Facial Behavior Analysis
ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval

The face is the most dominant and distinct communication tool of human beings. Automatic analysis of facial behavior allows machines to understand and interpret a human's states and needs for natural interactions. This research focuses on developing ...
Read More
Reconstruction of occluded facial images using asymmetrical Principal Component Analysis

When only non-occluded image parts are available for facial images it is difficult or impossible to correctly recognize the person in the image. The problem addressed in this work is reconstruction of the occluded parts in facial images; e.g. eyes ...
Read More
Can a phone hear the shape of a room?
IPSN '19: Proceedings of the 18th International Conference on Information Processing in Sensor Networks

Understanding the location of acoustically reflective surfaces in a room is a critical component in advanced sound processing. For example, intelligent speakers can use a room's acoustic geometry to improve playback quality, source separation accuracy, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Internet of Things Volume 5, Issue 1
February 2024
181 pages
EISSN:2577-6207
DOI:10.1145/3613526
Editor:
Gian Pietro Picco
University of Trento, Italy
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 16 December 2023
- Online AM: 9 August 2023
- Accepted: 24 July 2023
- Revised: 8 April 2023
- Received: 16 October 2022
Published in tiot Volume 5, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Wearable sensing
mobile computing
earable computing
facial reconstruction
IoT
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 614
  Total Downloads
- Downloads (Last 12 months)614
- Downloads (Last 6 weeks)66
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

I Am an Earphone and I Can Hear My User’s Face: Facial Landmark Tracking Using Smart Earphones

ACM Transactions on Internet of Things

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Facial Landmark Detection and Tracking for Facial Behavior Analysis

Reconstruction of occluded facial images using asymmetrical Principal Component Analysis

Can a phone hear the shape of a room?

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Full Text

Caption

I Am an Earphone and I Can Hear My User’s Face: Facial Landmark Tracking Using Smart Earphones

ACM Transactions on Internet of Things

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Facial Landmark Detection and Tracking for Facial Behavior Analysis

Reconstruction of occluded facial images using asymmetrical Principal Component Analysis

Can a phone hear the shape of a room?

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

Share this Publication link

Share on Social Media