Skip to main content

A Case Study of Audio Alignment for Multimedia Language Learning: Applications of SRGS and EMMA in Colibro Publishing

  • Chapter
  • First Online:
Multimodal Interaction with W3C Standards
  • 703 Accesses

Abstract

The synchronization of read-aloud audio and text in language learning is a powerful reinforcement for learners at all levels. In order to provide this kind of synchronized media experience, audio must be aligned with the text so that the correct audio plays while the related text is being presented or highlighted. One solution for aligning text and audio in this way is a manual process using an audio editor, but this is time-consuming, expensive, and error-prone. A much faster and less expensive alternative is automatic alignment through the use of speech recognition. Since the text and the matching audio are known ahead of time, the speech recognizer can perform this task with a very low error rate. Further enhancing accuracy is the fact that read-aloud stories are typically recorded with careful speech at a lower word-per-minute rate than is typical of conversational speech. In Colibro Publishing’s approach, a Speech Recognition Grammar Specification grammar is generated from the text and provided to a speech recognizer, which then generates Extensible Multimodal Annotation output with the exact audio timestamps for the beginning and end points of each sentence. The alignment is then used in the interactive story production process so that the correct audio is played with highlighted text.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In fact, the speech recognition in our application could occur many years after the original speech. This might happen, for example, if we wanted to align an historic speech with its transcription. In that case, the standard “emma:start” and “emma:end” timestamps would be very different from the processing time, since they refer to the start and end of speech.

References

  1. Eurostat (2016). Foreign language learning statistics. European Union. http://ec.europa.eu/eurostat/statistics-explained/index.php/Foreign_language_learning_statistics. Accessed 18 Jan 2016.

  2. Bhattacharjee, Y. (2012). Why bilinguals are smarter. New York Times, March 17.

    Google Scholar 

  3. Krashen, S. (1989). We acquire vocabulary and spelling by reading: Additional evidence for the input hypothesis. Modern Language Journal, 73, 393–407.

    Article  Google Scholar 

  4. Krashen, S. (2007). Free voluntary reading. Santa Barbara, CA: ABC-CLIO, LLC.

    Google Scholar 

  5. Lomicka, L. L. (1998). To gloss or not to gloss: An investigation of reading comprehension online. Language Learning and Technology, 1(2), 41–50.

    Google Scholar 

  6. Johnston, M. (2016). Extensible multimodal annotation for intelligent interactive systems. In D. Dahl (Ed.), Multimodal interaction with W3C standards: Towards natural user interfaces to everything. New York, NY: Springer.

    Google Scholar 

  7. Johnston, M., Baggia, P., Burnett, D., Carter, J., Dahl, D. A., McCobb, G., et al. (2009). EMMA: Extensible MultiModal Annotation markup language. W3C. http://www.w3.org/TR/emma/. Accessed 9 Nov 2012.

  8. Johnston, M., Dahl, D. A., Denny, T., & Kharidi, N. (2015). EMMA: Extensible MultiModal Annotation markup language Version 2.0. World Wide Web Consortium. http://www.w3.org/TR/emma20/. Accessed 16 Dec 2015.

  9. Hunt, A., & McGlashan, S. (2004). W3C Speech Recognition Grammar Specification (SRGS). W3C. http://www.w3.org/TR/speech-grammar/. Accessed 9 Nov 2012.

  10. Stanford Natural Language Processing Group (2014). Stanford CoreNLP. Stanford University. http://nlp.stanford.edu/software/corenlp.shtml.

  11. Galitz, W. O. (2007). The essential guide to user interface design: An introduction to GUI design principles and techniques (3rd ed.). Indianapolis, IN: Wiley Publishing, Inc.

    Google Scholar 

  12. Microsoft (2007). Microsoft Speech API 5.3 (SAPI). http://msdn2.microsoft.com/en-us/library/ms723627.aspx.

  13. Shenoy, A., Wu, Y., & Wang, Y. (2005). Singing voice detection for karaoke application. Paper Presented at the Proceedings SPIE 5960, Visual Communications and Image Processing 2005, Bellingham, WA, USA.

    Google Scholar 

  14. Wilcox, L. (1988). Annotation and segmentation for multimedia indexing and retrieval. In System Sciences, Proceedings of the Thirty-First Hawaii International Conference on System Sciences (Vol. 252), pp. 259–266. doi:10.1109/HICSS.1998.651708.

  15. Lee, K., Hagen, A., Romanyshyn, N., Martin, S., & Pellom, B. (2004). Analysis and detection of reading miscues for interactive literacy tutors. Paper Presented at the Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Deborah A. Dahl .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Dahl, D.A., Dooner, B. (2017). A Case Study of Audio Alignment for Multimedia Language Learning: Applications of SRGS and EMMA in Colibro Publishing. In: Dahl, D. (eds) Multimodal Interaction with W3C Standards. Springer, Cham. https://doi.org/10.1007/978-3-319-42816-1_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-42816-1_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-42814-7

  • Online ISBN: 978-3-319-42816-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics