short-paper

Dataset and Evaluation of Automatic Speech Recognition for Multi-lingual Intent Recognition on Social Robots

Authors:
Antonio Andriella

Pal Robotics, Barcelona, Spain

Pal Robotics, Barcelona, Spain

0000-0002-6641-6450
View Profile

,
Raquel Ros

Pal Robotics, Barcelona, Spain

Pal Robotics, Barcelona, Spain

0000-0002-8295-6932
View Profile

,
Yoav Ellinson

Bar-Ilan University, Tel-Aviv, Israel

Bar-Ilan University, Tel-Aviv, Israel

0009-0001-6116-2869
View Profile

,
Sharon Gannot

Bar-Ilan University, Tel-Aviv, Israel

Bar-Ilan University, Tel-Aviv, Israel

0000-0002-2885-170X
View Profile

,
Séverin Lemaignan

PAL Robotics, Barcelona, Spain

PAL Robotics, Barcelona, Spain

0000-0002-3391-8876
View Profile

HRI '24: Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot InteractionMarch 2024Pages 865–869https://doi.org/10.1145/3610977.3637473

Published:11 March 2024Publication History

HRI '24: Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction

Pages 865–869

ABSTRACT

While Automatic Speech Recognition (ASR) systems excel in controlled environments, challenges arise in robot-specific setups due to unique microphone requirements and added noise sources. In this paper, we create a dataset of initiating conversations with brief exchanges in 5 European languages, and we systematically evaluate current state-of-art ASR systems (Vosk, OpenWhisper, Google Speech and NVidia Riva). Besides standard metrics, we also look at two critical downstream tasks for human-robot verbal interaction: intent recognition rate and entity extraction, using the open-source Rasa chatbot. Overall, we found that open-source solutions as Vosk performs competitively with closed-source solutions while running on the edge, on a low compute budget (CPU only).

References

Online. https://alphacephei.com/vosk/.Google Scholar
Online. https://cloud.google.com/speech-to-text/?hl=en.Google Scholar
Online. https://github.com/openai/whisper.Google Scholar
Online. https://www.nvidia.com/en-us/ai-data-science/products/riva/.Google Scholar
Online. https://rasa.com/.Google Scholar
Online. https://rasa.com/docs/rasa/glossary/#intent.Google Scholar
Online. https://rasa.com/docs/rasa/glossary/#entity.Google Scholar
Online. https://wiki.seeedstudio.com/ReSpeaker_Mic_Array_v2.0/.Google Scholar
Online. https://pypi.org/project/jiwer.Google Scholar
Sören Becker, Marcel Ackermann, Sebastian Lapuschkin, Klaus-Robert Müller, and Wojciech Samek. 2018. Interpreting and Explaining Deep Neural Networks for Classification of Audio Signals. CoRR abs/1807.03418 (2018). arXiv:1807.03418Google Scholar
Sara Cooper, Alessandro Di Fava, Carlos Vivas, Luca Marchionni, and Francesco Ferro. 2020. ARI: the Social Assistive Robot and Companion. In 2020 29th IEEE International Conference on Robot and Human Interactive Communication (ROMAN). 745--751. https://doi.org/10.1109/RO-MAN47096.2020.9223470Google ScholarCross Ref
Lingyun Feng, Jianwei Yu, Deng Cai, Songxiang Liu, Haitao Zheng, and Yan Wang. 2022. ASR-GLUE: A New Multi-task Benchmark for ASR-Robust Natural Language Understanding. arXiv:2108.13048 [cs.CL]Google Scholar
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. 2020. Conformer: Convolution-augmented Transformer for Speech Recognition. arXiv:2005.08100 [eess.AS]Google Scholar
Zohar Jackson, César Souza, Jason Flaks, Yuxin Pan, Hereman Nicolas, and Adhish Thite. 2018. Jakobovski/free-spoken-digit-dataset: v1.0.8. https://doi.org/10.5281/ zenodo.1342401Google Scholar
S. Lemaignan, S. Cooper, R. Ros, L. Ferrini, A. Andriella, and A. Irisarri. 2023. Open-source Natural Language Processing on the PAL Robotics ARI Social Robot. In Proceedings of the 2023 ACM/IEEE International Conference on Human-Robot Interaction. https://doi.org/10.1145/3568294.3580041Google ScholarDigital Library
Mishaim Malik, Muhammad Malik, Khawar Mehmood, and Imran Makhdoom. 2021. Automatic speech recognition: a survey. Multimedia Tools and Applications 80 (03 2021), 1--47. https://doi.org/10.1007/s11042-020--10073--7Google ScholarCross Ref
Mirko Marras., Pedro A. Marín-Reyes., Javier Lorenzo-Navarro., Modesto Castrillón-Santana., and Gianni Fenu. 2019. AveRobot: An Audio-visual Dataset for People Re-identification and Verification in Human-Robot Interaction. In Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods - ICPRAM. INSTICC, SciTePress, 255--265. https://doi.org/10.5220/ 0007690902550265Google ScholarCross Ref
José Novoa-Ilic, Rodrigo Mahu, Jorge Wuth, Juan Escudero, Josué Fredes, and Nestor Yoma. 2021. Automatic Speech Recognition for Indoor HRI Scenarios. ACM Transactions on Human-Robot Interaction 10 (03 2021), 1--30. https://doi. org/10.1145/3442629Google Scholar

Index Terms

Dataset and Evaluation of Automatic Speech Recognition for Multi-lingual Intent Recognition on Social Robots
1. Computer systems organization
  1. Embedded and cyber-physical systems
    1. Robotics
2. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition

Recommendations

Psycho-acoustics inspired automatic speech recognition
Abstract
Understanding the human spoken language recognition process is still a far scientific goal. Nowadays, commercial automatic speech recognisers (ASRs) achieve high performance at recognising clean speech, but their approaches are poorly ...
Highlights
- We propose a novel Automatic Speech Recognizer inspired by psycho-acoustic studies.
Read More
Measuring the intelligibility of dysarthric speech through automatic speech recognition in a pluricentric language
Highlights
- Adding dysarthric speech resources from the dominant variety for training improves automatic recognition of dysarthric speech of the non-dominant variety.
Abstract
Speech intelligibility is an essential though complex construct for evaluating dysarthric speech. Various procedures can be used to measure speech intelligibility, most of which are based on subjective ratings assigned by experts. ...
Read More
Prosody modification for speech recognition in emotionally mismatched conditions

A degradation in the performance of automatic speech recognition systems (ASR) is observed in mismatched training and testing conditions. One of the reasons for this degradation is due to the presence of emotions in the speech. The main objective of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HRI '24: Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction
March 2024
982 pages
ISBN:9798400703225
DOI:10.1145/3610977
General Chairs:
Dan Grollman
Plus One Robotics, USA
,
Elizabeth Broadbent
University of Auckland, New Zealand
,
Program Chairs:
Wendy Ju
Cornell Tech, USA
,
Harold Soh
National University of Singapore, Singapore
,
Tom Williams
Colorado School of Mines, USA
Copyright © 2024 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 March 2024
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
assistive robotics
audio dataset
automatic speech recognition
human-robot interaction
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate242of1,000submissions,24%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 27
  Total Downloads
- Downloads (Last 12 months)27
- Downloads (Last 6 weeks)27
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Dataset and Evaluation of Automatic Speech Recognition for Multi-lingual Intent Recognition on Social Robots

HRI '24: Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction

ABSTRACT

References

Cited By

Index Terms

Recommendations

Psycho-acoustics inspired automatic speech recognition

Measuring the intelligibility of dysarthric speech through automatic speech recognition in a pluricentric language

Prosody modification for speech recognition in emotionally mismatched conditions