A Case Study of Audio Alignment for Multimedia Language Learning: Applications of SRGS and EMMA in Colibro Publishing

Dahl, Deborah A.; Dooner, Brian

doi:10.1007/978-3-319-42816-1_14

Deborah A. Dahl² &
Brian Dooner³

703 Accesses

Abstract

The synchronization of read-aloud audio and text in language learning is a powerful reinforcement for learners at all levels. In order to provide this kind of synchronized media experience, audio must be aligned with the text so that the correct audio plays while the related text is being presented or highlighted. One solution for aligning text and audio in this way is a manual process using an audio editor, but this is time-consuming, expensive, and error-prone. A much faster and less expensive alternative is automatic alignment through the use of speech recognition. Since the text and the matching audio are known ahead of time, the speech recognizer can perform this task with a very low error rate. Further enhancing accuracy is the fact that read-aloud stories are typically recorded with careful speech at a lower word-per-minute rate than is typical of conversational speech. In Colibro Publishing’s approach, a Speech Recognition Grammar Specification grammar is generated from the text and provided to a speech recognizer, which then generates Extensible Multimodal Annotation output with the exact audio timestamps for the beginning and end points of each sentence. The alignment is then used in the interactive story production process so that the correct audio is played with highlighted text.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In fact, the speech recognition in our application could occur many years after the original speech. This might happen, for example, if we wanted to align an historic speech with its transcription. In that case, the standard “emma:start” and “emma:end” timestamps would be very different from the processing time, since they refer to the start and end of speech.

References

Eurostat (2016). Foreign language learning statistics. European Union. http://ec.europa.eu/eurostat/statistics-explained/index.php/Foreign_language_learning_statistics. Accessed 18 Jan 2016.
Bhattacharjee, Y. (2012). Why bilinguals are smarter. New York Times, March 17.
Google Scholar
Krashen, S. (1989). We acquire vocabulary and spelling by reading: Additional evidence for the input hypothesis. Modern Language Journal, 73, 393–407.
Article Google Scholar
Krashen, S. (2007). Free voluntary reading. Santa Barbara, CA: ABC-CLIO, LLC.
Google Scholar
Lomicka, L. L. (1998). To gloss or not to gloss: An investigation of reading comprehension online. Language Learning and Technology, 1(2), 41–50.
Google Scholar
Johnston, M. (2016). Extensible multimodal annotation for intelligent interactive systems. In D. Dahl (Ed.), Multimodal interaction with W3C standards: Towards natural user interfaces to everything. New York, NY: Springer.
Google Scholar
Johnston, M., Baggia, P., Burnett, D., Carter, J., Dahl, D. A., McCobb, G., et al. (2009). EMMA: Extensible MultiModal Annotation markup language. W3C. http://www.w3.org/TR/emma/. Accessed 9 Nov 2012.
Johnston, M., Dahl, D. A., Denny, T., & Kharidi, N. (2015). EMMA: Extensible MultiModal Annotation markup language Version 2.0. World Wide Web Consortium. http://www.w3.org/TR/emma20/. Accessed 16 Dec 2015.
Hunt, A., & McGlashan, S. (2004). W3C Speech Recognition Grammar Specification (SRGS). W3C. http://www.w3.org/TR/speech-grammar/. Accessed 9 Nov 2012.
Stanford Natural Language Processing Group (2014). Stanford CoreNLP. Stanford University. http://nlp.stanford.edu/software/corenlp.shtml.
Galitz, W. O. (2007). The essential guide to user interface design: An introduction to GUI design principles and techniques (3rd ed.). Indianapolis, IN: Wiley Publishing, Inc.
Google Scholar
Microsoft (2007). Microsoft Speech API 5.3 (SAPI). http://msdn2.microsoft.com/en-us/library/ms723627.aspx.
Shenoy, A., Wu, Y., & Wang, Y. (2005). Singing voice detection for karaoke application. Paper Presented at the Proceedings SPIE 5960, Visual Communications and Image Processing 2005, Bellingham, WA, USA.
Google Scholar
Wilcox, L. (1988). Annotation and segmentation for multimedia indexing and retrieval. In System Sciences, Proceedings of the Thirty-First Hawaii International Conference on System Sciences (Vol. 252), pp. 259–266. doi:10.1109/HICSS.1998.651708.
Lee, K., Hagen, A., Romanyshyn, N., Martin, S., & Pellom, B. (2004). Analysis and detection of reading miscues for interactive literacy tutors. Paper Presented at the Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland.
Google Scholar

Download references

Author information

Authors and Affiliations

Conversational Technologies, Plymouth Meeting, PA, USA
Deborah A. Dahl
Colibro Publishing, Philadelphia, PA, USA
Brian Dooner

Authors

Deborah A. Dahl
View author publications
You can also search for this author in PubMed Google Scholar
Brian Dooner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Deborah A. Dahl .

Editor information

Editors and Affiliations

Conversational Technologies, Plymouth Meeting, Pennsylvania, USA
Deborah A. Dahl

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Dahl, D.A., Dooner, B. (2017). A Case Study of Audio Alignment for Multimedia Language Learning: Applications of SRGS and EMMA in Colibro Publishing. In: Dahl, D. (eds) Multimodal Interaction with W3C Standards. Springer, Cham. https://doi.org/10.1007/978-3-319-42816-1_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-42816-1_14
Published: 18 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-42814-7
Online ISBN: 978-3-319-42816-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics