ABSTRACT
While Automatic Speech Recognition (ASR) systems excel in controlled environments, challenges arise in robot-specific setups due to unique microphone requirements and added noise sources. In this paper, we create a dataset of initiating conversations with brief exchanges in 5 European languages, and we systematically evaluate current state-of-art ASR systems (Vosk, OpenWhisper, Google Speech and NVidia Riva). Besides standard metrics, we also look at two critical downstream tasks for human-robot verbal interaction: intent recognition rate and entity extraction, using the open-source Rasa chatbot. Overall, we found that open-source solutions as Vosk performs competitively with closed-source solutions while running on the edge, on a low compute budget (CPU only).
- Online. https://alphacephei.com/vosk/.Google Scholar
- Online. https://cloud.google.com/speech-to-text/?hl=en.Google Scholar
- Online. https://github.com/openai/whisper.Google Scholar
- Online. https://www.nvidia.com/en-us/ai-data-science/products/riva/.Google Scholar
- Online. https://rasa.com/.Google Scholar
- Online. https://rasa.com/docs/rasa/glossary/#intent.Google Scholar
- Online. https://rasa.com/docs/rasa/glossary/#entity.Google Scholar
- Online. https://wiki.seeedstudio.com/ReSpeaker_Mic_Array_v2.0/.Google Scholar
- Online. https://pypi.org/project/jiwer.Google Scholar
- Sören Becker, Marcel Ackermann, Sebastian Lapuschkin, Klaus-Robert Müller, and Wojciech Samek. 2018. Interpreting and Explaining Deep Neural Networks for Classification of Audio Signals. CoRR abs/1807.03418 (2018). arXiv:1807.03418Google Scholar
- Sara Cooper, Alessandro Di Fava, Carlos Vivas, Luca Marchionni, and Francesco Ferro. 2020. ARI: the Social Assistive Robot and Companion. In 2020 29th IEEE International Conference on Robot and Human Interactive Communication (ROMAN). 745--751. https://doi.org/10.1109/RO-MAN47096.2020.9223470Google ScholarCross Ref
- Lingyun Feng, Jianwei Yu, Deng Cai, Songxiang Liu, Haitao Zheng, and Yan Wang. 2022. ASR-GLUE: A New Multi-task Benchmark for ASR-Robust Natural Language Understanding. arXiv:2108.13048 [cs.CL]Google Scholar
- Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. 2020. Conformer: Convolution-augmented Transformer for Speech Recognition. arXiv:2005.08100 [eess.AS]Google Scholar
- Zohar Jackson, César Souza, Jason Flaks, Yuxin Pan, Hereman Nicolas, and Adhish Thite. 2018. Jakobovski/free-spoken-digit-dataset: v1.0.8. https://doi.org/10.5281/ zenodo.1342401Google Scholar
- S. Lemaignan, S. Cooper, R. Ros, L. Ferrini, A. Andriella, and A. Irisarri. 2023. Open-source Natural Language Processing on the PAL Robotics ARI Social Robot. In Proceedings of the 2023 ACM/IEEE International Conference on Human-Robot Interaction. https://doi.org/10.1145/3568294.3580041Google ScholarDigital Library
- Mishaim Malik, Muhammad Malik, Khawar Mehmood, and Imran Makhdoom. 2021. Automatic speech recognition: a survey. Multimedia Tools and Applications 80 (03 2021), 1--47. https://doi.org/10.1007/s11042-020--10073--7Google ScholarCross Ref
- Mirko Marras., Pedro A. Marín-Reyes., Javier Lorenzo-Navarro., Modesto Castrillón-Santana., and Gianni Fenu. 2019. AveRobot: An Audio-visual Dataset for People Re-identification and Verification in Human-Robot Interaction. In Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods - ICPRAM. INSTICC, SciTePress, 255--265. https://doi.org/10.5220/ 0007690902550265Google ScholarCross Ref
- José Novoa-Ilic, Rodrigo Mahu, Jorge Wuth, Juan Escudero, Josué Fredes, and Nestor Yoma. 2021. Automatic Speech Recognition for Indoor HRI Scenarios. ACM Transactions on Human-Robot Interaction 10 (03 2021), 1--30. https://doi. org/10.1145/3442629Google Scholar
Index Terms
- Dataset and Evaluation of Automatic Speech Recognition for Multi-lingual Intent Recognition on Social Robots
Recommendations
Psycho-acoustics inspired automatic speech recognition
AbstractUnderstanding the human spoken language recognition process is still a far scientific goal. Nowadays, commercial automatic speech recognisers (ASRs) achieve high performance at recognising clean speech, but their approaches are poorly ...
Highlights- We propose a novel Automatic Speech Recognizer inspired by psycho-acoustic studies.
Measuring the intelligibility of dysarthric speech through automatic speech recognition in a pluricentric language
Highlights- Adding dysarthric speech resources from the dominant variety for training improves automatic recognition of dysarthric speech of the non-dominant variety.
AbstractSpeech intelligibility is an essential though complex construct for evaluating dysarthric speech. Various procedures can be used to measure speech intelligibility, most of which are based on subjective ratings assigned by experts. ...
Prosody modification for speech recognition in emotionally mismatched conditions
A degradation in the performance of automatic speech recognition systems (ASR) is observed in mismatched training and testing conditions. One of the reasons for this degradation is due to the presence of emotions in the speech. The main objective of ...
Comments