Automatic Generation of Subtitles for Videos of the Government of La Rioja

Martín, Mirari San; Heras, Jónathan; Mata, Gadea

doi:10.1007/978-3-031-34020-8_30

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1824))

Included in the following conference series:

International Conference on Optimization and Learning

336 Accesses
5 Altmetric

Abstract

Nowadays, public institutions usually provide videos that contain important information in their webpages. However, people suffering from hearing impairment have difficulties accessing content provided by that mean, and the manual transcription of those videos is a time-consuming task. This problem can be faced by means of Automatic Speech Recognition (ASR) systems. In this work, we have evaluated the performance of several ASR systems when applied to videos from the Government of La Rioja, Spain. Our study shows that the Whisper medium model provides the best trade-off between accuracy and speed. Using this model, we have generated the transcription of all the videos from the YouTube channel of the Government of La Rioja. In addition, we have created a tool to facilitate this task for other YouTube Spanish channels. Hence, this can be seen as a step towards improving the accessibility of the information and contents produced by Spanish public administrations.

This work was partially supported by Ministerio de Ciencia e Innovación [PID2020-115225RB-I00 / AEI / 10.13039/501100011033], OTRI OTCA221110 and, by a regional project of the Government of La Rioja.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.youtube.com/@GobiernoDeLaRiojaES.
2.
On July 28th, 2022.
3.
The transcriptions of the videos are available at https://github.com/mirenmirari/subtitulos_canalgobierno.
4.
Available at https://huggingface.co/spaces/mirari/Whisper-Youtube.

References

Ardila, R., et al.: Common voice: a massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670 (2019)
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020)
Google Scholar
Chan, W., Jaitly, N., Le, Q.V., Vinyals, O.: Listen, attend and spell. arXiv preprint arXiv:1508.01211 (2015)
de España, C.G.: Ley 34/2002, de 11 de julio, de servicios de la sociedad de la información y de comercio electrónico. No 166 12 (2002)
Google Scholar
Gulati, A., et al.: Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020)
Hannun, A., et al.: Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014)
Hernandez Mena, C.D.: Acoustic model in spanish: stt_es_quartznet15x5_ft_ep53_944h. (2022). https://huggingface.co/carlosdanielhernandezmena/stt_es_quartznet15x5_ft_ep53_944h
Hong, R., et al.: Video accessibility enhancement for hearing-impaired users. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 7(1), 1–19 (2011)
Google Scholar
Hrinchuk, O., et al.: Nvidia nemo offline speech translation systems for IWSLT 2022. In: Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022), pp. 225–231 (2022)
Google Scholar
Hugging Face: Hugging Face Hub (2022). https://huggingface.co/docs/hub/index
Huggins-Daines, D., Kumar, M., Chan, A., Black, A.W., Ravishankar, M., Rudnicky, A.I.: Pocketsphinx: a free, real-time continuous speech recognition system for hand-held devices. In: 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 1, p. I. IEEE (2006)
Google Scholar
Jurafsky, D., Martin, J.H.: Speech and language processing (3rd draft ed.), 2019 (2022)
Google Scholar
Kriman, S., et al.: QuartzNet: deep automatic speech recognition with 1d time-channel separable convolutions. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6124–6128. IEEE (2020)
Google Scholar
Levenshtein, V.I., et al.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics doklady, vol. 10, pp. 707–710. Soviet Union (1966)
Google Scholar
Li, J., et al.: Jasper: an end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
Malik, M., Malik, M.K., Mehmood, K., Makhdoom, I.: Automatic speech recognition: a survey. Multimed. Tools Appl. 80, 9411–9457 (2021)
Article Google Scholar
Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., Collobert, R.: MLS: a large-scale multilingual dataset for speech research. arXiv preprint arXiv:2012.03411 (2020)
Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022)
Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 806–813 (2014)
Google Scholar
Woodard, J., Nelson, J.: An information theoretic measure of speech recognition performance. In: Workshop on Standardisation for Speech I/O Technology, Naval Air Development Center, Warminster, PA (1982)
Google Scholar
Zhuang, F., et al.: A comprehensive survey on transfer learning. Proc. IEEE 109(1), 43–76 (2020)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of La Rioja, Logroño, Spain
Mirari San Martín, Jónathan Heras & Gadea Mata

Authors

Mirari San Martín
View author publications
You can also search for this author in PubMed Google Scholar
Jónathan Heras
View author publications
You can also search for this author in PubMed Google Scholar
Gadea Mata
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mirari San Martín .

Editor information

Editors and Affiliations

University of Cadiz, Cadiz, Spain
Bernabé Dorronsoro
University of Malaga, Malaga, Spain
Francisco Chicano
University of Luxembourg, Esch-sur-Alzette, Luxembourg
Gregoire Danoy
University of Lille, Lille, France
El-Ghazali Talbi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Martín, M.S., Heras, J., Mata, G. (2023). Automatic Generation of Subtitles for Videos of the Government of La Rioja. In: Dorronsoro, B., Chicano, F., Danoy, G., Talbi, EG. (eds) Optimization and Learning. OLA 2023. Communications in Computer and Information Science, vol 1824. Springer, Cham. https://doi.org/10.1007/978-3-031-34020-8_30

Download citation

DOI: https://doi.org/10.1007/978-3-031-34020-8_30
Published: 27 May 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-34019-2
Online ISBN: 978-3-031-34020-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics