“Allot?” is “A Lot!” Towards Developing More Generalized Speech Recognition System for Accessible Communication
DOI:
https://doi.org/10.1609/aaai.v38i21.30381Keywords:
Deep Learning, Machine Learning, Automatic Speech Recognition, Audio And Speech Processing, Wav2vec 2.0, Sound, Computation And Language, Data Augmentation, Accented SpeechAbstract
The proliferation of Automatic Speech Recognition (ASR) systems has revolutionized translation and transcription. However, challenges persist in ensuring inclusive communication for non-native English speakers. This study quantifies the gap between accented and native English speech using Wav2Vec 2.0, a state-of-the-art transformer model. Notably, we found that accented speech exhibits significantly higher word error rates of 30-50%, in contrast to native speakers’ 2-8% (Baevski et al. 2020). Our exploration extends to leveraging accessible online datasets to highlight the potential of enhancing speech recognition by fine-tuning the Wav2Vec 2.0 model. Through experimentation and analysis, we highlight the challenges with training models on accented speech. By refining models and addressing data quality issues, our work presents a pipeline for future investigations aimed at developing an integrated system capable of effectively engaging with a broader range of individuals with diverse backgrounds. Accurate recognition of accented speech is a pivotal step toward democratizing AI-driven communication products.Downloads
Published
2024-03-24
How to Cite
Bandodkar, G., Agarwal, S., Sughosh, A. K., Singh, S., & Choi, T. (2024). “Allot?” is “A Lot!” Towards Developing More Generalized Speech Recognition System for Accessible Communication. Proceedings of the AAAI Conference on Artificial Intelligence, 38(21), 23327-23334. https://doi.org/10.1609/aaai.v38i21.30381
Issue
Section
EAAI: Mentored Undergraduate Research Challenge: AI for Accessibility in Comm