Skip to main content

Advertisement

Log in

A mechanism for personalized Automatic Speech Recognition for less frequently spoken languages: the Greek case

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Automatic Speech Recognition (ASR) has become increasingly popular since it significantly simplifies human-computer interaction, providing a more intuitive way of communication. Building an accurate, general-purpose ASR system is a challenging task that requires a lot of data and computing power. Especially for languages not widely spoken, such as Greek, the lack of adequately large speech datasets leads to the development of ASR systems adapted to a restricted corpus and/or for specific topics. When used in specific domains, these systems can be both accurate and fast, without the need for large datasets and extended training. An interesting application domain of such narrow-scope ASR systems is the development of personalized models that can be used for dictation. In the current work we present three personalization-via-adaptation modules, that can be integrated into any ASR/dictation system and increase its accuracy. The adaptation can be applied both on the language model (based on past text samples of the user) as well as on the acoustic model (using a set of user’s narrations). To provide more precise recommendations, clustering algorithms are applied and topic-specific language models are created. Also, heterogeneous adaptation methods are combined to provide recommendations to the user. Evaluation performed on a self-created database containing 746 corpora included in messaging applications and e-mails from the same user, demonstrates that the proposed approach can achieve better results than the vanilla existing Greek models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Code Availability

The code is available on GitHubFootnote 12.

Notes

  1. https://github.com/cmusphinx/pocketsphinx

  2. https://pypi.org/project/num2words/

  3. https://pypi.org/project/alphabet-detector/

  4. https://github.com/cmusphinx/sphinxbase

  5. https://github.com/cmusphinx/sphinxtrain

  6. https://spacy.io/models/el/

  7. https://fasttext.cc/docs/en/crawl-vectors.html

  8. https://github.com/PanosAntoniadis/fast-recorder

  9. https://pypi.org/project/sounddevice/

  10. https://github.com/eellak/gsoc2019-sphinx

  11. https://summerofcode.withgoogle.com/archive/2019/projects/4683496042266624/

  12. https://github.com/PanosAntoniadis/personalized_asr

References

  1. Arora SJ, Singh RP (2012) Automatic speech recognition: a review. Int J Comput Appl 60(9)

  2. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. Mach Learn Res 3:993–1022. https://doi.org/10.1162/jmlr.2003.3.4-5.993

  3. Brown PF, Della Pietra VJ, Desouza PV, Lai JC, Mercer RL (1992) Class-based n-gram models of natural language. Comput Linguist 18(4):467–480

    Google Scholar 

  4. Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146. https://doi.org/10.1162/tacl_a_00051

  5. Cephei A (2021) Vosk offline speech recognition API. https://alphacephei.com/vosk/

  6. Chan W, Jaitly N, Le Q, Vinyals O (2016) Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In: Proceedings of IEEE ICASSP, pp 4960–4964

  7. Chiu C-C, Sainath TN, Wu Y, Prabhavalkar R, Nguyen P, Chen Z, Kannan A, Weiss RJ, Rao K, Gonina E et al (2018) State-of-the-art speech recognition with sequence-to-sequence models. In: Proceedings of IEEE ICASSP, pp 4774–4778

  8. CMU (accessed May 21, 2021a) Sphinx: Acoustic Model Types. https://cmusphinx.github.io/wiki/acousticmodeltypes/

  9. CMU (accessed May 21, 2021b) Sphinx: Adapting the default acoustic model. https://cmusphinx.github.io/wiki/tutorialadapt/

  10. CMU (accessed May 21, 2021c) Sphinx: Training acoustic model. https://cmusphinx.github.io/wiki/tutorial/

  11. Dahl GE, Yu D, Deng L, Acero A (2011) Large vocabulary continuous speech recognition with context-dependent dbn-hmms. In: Proceedings of IEEE ICASSP, pp 4688–4691

  12. Digalakis V, Oikonomidis D, Pratsolis D, Tsourakis N, Vosnidis C, Chatzichrisafis N, Diakoloukas V (2003) Large vocabulary continuous speech recognition in greek: Corpus and an automatic dictation system. In: Proceedings of EUROSPEECH, pp 1565–1568

  13. Gaida C, Lange P, Petrick R, Proba P, Malatawy A, Suendermann-Oeft D (2014) Comparing open-source speech recognition toolkits. Tech. Rep., DHBW Stuttgart

  14. Gales M, Young S (2008) Application of hidden markov models in speech recognition. Now Foundations and Trends

  15. Graves A, Mohamed A-, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: Proceedings of IEEE ICASSP, pp 6645–6649

  16. Google (accessed May 21, 2021) Cloud Speech-to-Text API. https://cloud.google.com/speech-to-text

  17. Huggins-Daines D, Kumar M, Chan A, Black AW, Ravishankar M, Rudnicky AI (2006) Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices. In: Proceedings of IEEE ICASSP, vol 1, pp I–I

  18. Jacob RJK, Leggett JJ, Myers BA, Pausch R (1993) Interaction styles and input/output devices. Behav Inf Technol 12(2):69–79. https://doi.org/10.1080/01449299308924369

    Article  Google Scholar 

  19. Jaitly N, Le QV, Vinyals O, Sutskever I, Sussillo D, Bengio S (2016) An online sequence-to-sequence model using partial conditioning. In: Proceedings of Advances in NIPS, vol 29. Curran Associates, Inc.

  20. Kunze J, Kirsch L, Kurenkov I, Krug A, Johannsmeier J, Stober S (2017) Transfer learning for speech recognition on a budget. In: Proceedings of RepL4NLP, pp 168–177

  21. Macskassy SA, Hirsh H, Banerjee A, Dayanik AA (2003) Converting numerical classification into text classification. Artif Intell 143(1):51–77. https://doi.org/10.1016/s0004-3702(02)00359-4

  22. Martinčić-Ipšić S, Pobar M, Ipšić I (2011) Croatian large vocabulary automatic speech recognition. Automatika 52(2):147–157. https://doi.org/10.1080/00051144.2011.11828413

    Article  Google Scholar 

  23. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:13013781

  24. Militaru D, Gavat I, Dumitru O, Zaharia T, Segarceanu S (2009) Protologos, system for romanian language automatic speech recognition and understanding (asru). In: Proceedings of SpeD, pp 1–9

  25. Mohamed A-, Yu D, Deng L (2010) Investigation of full-sequence training of deep belief networks for speech recognition. In: Proceedings of INTERSPEECH, pp 2846–2849

  26. Morbini F, Audhkhasi K, Sagae K, Artstein R, Can D, Georgiou P, Narayanan S, Leuski A, Traum D (2013) Which asr should i choose for my dialogue system?. In: Proceedings of SIGDIAL, pp 394–403

  27. Morgan N, Bourlard H (1990) Continuous speech recognition using multilayer perceptrons with hidden markov models. In: Proceedings of IEEE ICASSP, pp 413–416

  28. Mulbregt Pv, Carp I, Gillick L, Lowe S, Yamron J (1998) Text segmentation and topic tracking on broadcast news via a hidden markov model approach. In: Proceedings of ICSLP, pp 2519–2522

  29. Oikonomidis D, Digalakis V (2003) Stem-based maximum entropy language models for inflectional languages. In: Proceedings of EUROSPEECH, pp 2285–2288

  30. Pantazoglou F, Papadakis N, Kladis G (2017) Implementation of the generic greek model for cmu sphinx speech recognition toolkit. In: Proceedings of eRA-12

  31. Pleva M, Juhár J (2014) Tuke-bnews-sk: Slovak broadcast news corpus construction and evaluation. In: Proceedings of LREC, pp 1709–1713

  32. Rabiner LR, Juang B-H, Levinson SE, Sondhi MM (1985) Recognition of isolated digits using hidden markov models with continuous mixture densities. AT&T Techn J 64(6):1211–1234. https://doi.org/10.1002/j.1538-7305.1985.tb00272.x

    Article  Google Scholar 

  33. Raffel C, Luong M-T, Liu PJ, Weiss RJ, Eck D (2017) Online and linear-time attention by enforcing monotonic alignments. In: Proceedings of ICML, pp 2837–2846

  34. Rusko M, Juhár J, Trnka M, Staš J, Darjaa S, Hládek D, Sabo R, Pleva M, Ritomskỳ M, Lojka M (2014) Slovak automatic dictation system for judicial domain. In: Proceedings of Human Language Technology Challenges for Computer Science and Linguistics, pp 16–27

  35. Sak H, Shannon M, Rao K, Beaufays F (2017) Recurrent neural aligner: An encoder-decoder neural network model for sequence to sequence mapping.. In: Proceedings of INTERSPEECH, vol 8, pp 1298–1302

  36. Schlippe T, Volovyk M, Yurchenko K, Schultz T (2013) Rapid bootstrapping of a ukrainian large vocabulary continuous speech recognition system. In: Proceedings of IEEE ICASSP, pp 7329–7333

  37. Stolcke A (2002) Srilm-an extensible language modeling toolkit. In: Proceedings of ICSLP, pp 901–904

  38. Tamura M, Masuko T, Tokuda K, Kobayashi T (2001) Adaptation of pitch and spectrum for hmm-based speech synthesis using mllr. In: Proceedings of IEEE ICASSP, vol 2, pp 805–808

  39. Trentin E, Gori M (2001) A survey of hybrid ann/hmm models for automatic speech recognition. Neurocomputing 37(1-4):91–126. https://doi.org/10.1016/s0925-2312(00)00308-8

    Article  Google Scholar 

  40. Tsardoulias EG, Symeonidis AL, Mitkas PA (2015) An automatic speech detection architecture for social robot oral interaction. In: Proceedings of Audio Mostly 2015 on Interaction With Sound

  41. Waibel A, Hanazawa T, Hinton G, Shikano K, Lang K (1988) Phoneme recognition: neural networks vs. hidden markov models. In: Proceedings of IEEE ICASSP, pp 107–108

  42. Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D et al (2002) The htk book. Cambridge university 3(175):12

    Google Scholar 

  43. Zgank A, Vitez AZ, Verdonik D (2014) The slovene bnsi broadcast news database and reference speech corpus gos: Towards the uniform guidelines for future work.. In: Proceedings of LREC, pp 2644–2647

  44. Zhang L, Renals S (2006) Phone recognition analysis for trajectory hmm. In: Proceedings of INTERSPEECH, pp 589–592

  45. Ziolko B, Jadczyk T, Skurzok D, Żlasko P, Gałka J, Pȩdzima̧ż T, Gawlik I, Pałka S (2015) Sarmata 2.0 automatic polish language speech recognition system. In: Proceedings of ISCA, pp 1062–1063

Download references

Funding

Part of this work was supported by Google Summer of Code as an open source projectFootnote 11.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Panagiotis Antoniadis.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Antoniadis, P., Tsardoulias, E. & Symeonidis, A. A mechanism for personalized Automatic Speech Recognition for less frequently spoken languages: the Greek case. Multimed Tools Appl 81, 40635–40652 (2022). https://doi.org/10.1007/s11042-022-12953-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-12953-6

Keywords

Navigation