Skip to main content

Automatic Generation of Subtitles for Videos of the Government of La Rioja

  • Conference paper
  • First Online:
Optimization and Learning (OLA 2023)

Abstract

Nowadays, public institutions usually provide videos that contain important information in their webpages. However, people suffering from hearing impairment have difficulties accessing content provided by that mean, and the manual transcription of those videos is a time-consuming task. This problem can be faced by means of Automatic Speech Recognition (ASR) systems. In this work, we have evaluated the performance of several ASR systems when applied to videos from the Government of La Rioja, Spain. Our study shows that the Whisper medium model provides the best trade-off between accuracy and speed. Using this model, we have generated the transcription of all the videos from the YouTube channel of the Government of La Rioja. In addition, we have created a tool to facilitate this task for other YouTube Spanish channels. Hence, this can be seen as a step towards improving the accessibility of the information and contents produced by Spanish public administrations.

This work was partially supported by Ministerio de Ciencia e Innovación [PID2020-115225RB-I00 / AEI / 10.13039/501100011033], OTRI OTCA221110 and, by a regional project of the Government of La Rioja.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.youtube.com/@GobiernoDeLaRiojaES.

  2. 2.

    On July 28th, 2022.

  3. 3.

    The transcriptions of the videos are available at https://github.com/mirenmirari/subtitulos_canalgobierno.

  4. 4.

    Available at https://huggingface.co/spaces/mirari/Whisper-Youtube.

References

  1. Ardila, R., et al.: Common voice: a massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670 (2019)

  2. Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020)

    Google Scholar 

  3. Chan, W., Jaitly, N., Le, Q.V., Vinyals, O.: Listen, attend and spell. arXiv preprint arXiv:1508.01211 (2015)

  4. de España, C.G.: Ley 34/2002, de 11 de julio, de servicios de la sociedad de la información y de comercio electrónico. No 166 12 (2002)

    Google Scholar 

  5. Gulati, A., et al.: Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020)

  6. Hannun, A., et al.: Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014)

  7. Hernandez Mena, C.D.: Acoustic model in spanish: stt_es_quartznet15x5_ft_ep53_944h. (2022). https://huggingface.co/carlosdanielhernandezmena/stt_es_quartznet15x5_ft_ep53_944h

  8. Hong, R., et al.: Video accessibility enhancement for hearing-impaired users. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 7(1), 1–19 (2011)

    Google Scholar 

  9. Hrinchuk, O., et al.: Nvidia nemo offline speech translation systems for IWSLT 2022. In: Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022), pp. 225–231 (2022)

    Google Scholar 

  10. Hugging Face: Hugging Face Hub (2022). https://huggingface.co/docs/hub/index

  11. Huggins-Daines, D., Kumar, M., Chan, A., Black, A.W., Ravishankar, M., Rudnicky, A.I.: Pocketsphinx: a free, real-time continuous speech recognition system for hand-held devices. In: 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 1, p. I. IEEE (2006)

    Google Scholar 

  12. Jurafsky, D., Martin, J.H.: Speech and language processing (3rd draft ed.), 2019 (2022)

    Google Scholar 

  13. Kriman, S., et al.: QuartzNet: deep automatic speech recognition with 1d time-channel separable convolutions. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6124–6128. IEEE (2020)

    Google Scholar 

  14. Levenshtein, V.I., et al.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics doklady, vol. 10, pp. 707–710. Soviet Union (1966)

    Google Scholar 

  15. Li, J., et al.: Jasper: an end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)

  16. Malik, M., Malik, M.K., Mehmood, K., Makhdoom, I.: Automatic speech recognition: a survey. Multimed. Tools Appl. 80, 9411–9457 (2021)

    Article  Google Scholar 

  17. Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., Collobert, R.: MLS: a large-scale multilingual dataset for speech research. arXiv preprint arXiv:2012.03411 (2020)

  18. Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022)

  19. Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 806–813 (2014)

    Google Scholar 

  20. Woodard, J., Nelson, J.: An information theoretic measure of speech recognition performance. In: Workshop on Standardisation for Speech I/O Technology, Naval Air Development Center, Warminster, PA (1982)

    Google Scholar 

  21. Zhuang, F., et al.: A comprehensive survey on transfer learning. Proc. IEEE 109(1), 43–76 (2020)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mirari San Martín .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Martín, M.S., Heras, J., Mata, G. (2023). Automatic Generation of Subtitles for Videos of the Government of La Rioja. In: Dorronsoro, B., Chicano, F., Danoy, G., Talbi, EG. (eds) Optimization and Learning. OLA 2023. Communications in Computer and Information Science, vol 1824. Springer, Cham. https://doi.org/10.1007/978-3-031-34020-8_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-34020-8_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-34019-2

  • Online ISBN: 978-3-031-34020-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics