Abstract
The synchronization of read-aloud audio and text in language learning is a powerful reinforcement for learners at all levels. In order to provide this kind of synchronized media experience, audio must be aligned with the text so that the correct audio plays while the related text is being presented or highlighted. One solution for aligning text and audio in this way is a manual process using an audio editor, but this is time-consuming, expensive, and error-prone. A much faster and less expensive alternative is automatic alignment through the use of speech recognition. Since the text and the matching audio are known ahead of time, the speech recognizer can perform this task with a very low error rate. Further enhancing accuracy is the fact that read-aloud stories are typically recorded with careful speech at a lower word-per-minute rate than is typical of conversational speech. In Colibro Publishing’s approach, a Speech Recognition Grammar Specification grammar is generated from the text and provided to a speech recognizer, which then generates Extensible Multimodal Annotation output with the exact audio timestamps for the beginning and end points of each sentence. The alignment is then used in the interactive story production process so that the correct audio is played with highlighted text.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In fact, the speech recognition in our application could occur many years after the original speech. This might happen, for example, if we wanted to align an historic speech with its transcription. In that case, the standard “emma:start” and “emma:end” timestamps would be very different from the processing time, since they refer to the start and end of speech.
References
Eurostat (2016). Foreign language learning statistics. European Union. http://ec.europa.eu/eurostat/statistics-explained/index.php/Foreign_language_learning_statistics. Accessed 18 Jan 2016.
Bhattacharjee, Y. (2012). Why bilinguals are smarter. New York Times, March 17.
Krashen, S. (1989). We acquire vocabulary and spelling by reading: Additional evidence for the input hypothesis. Modern Language Journal, 73, 393–407.
Krashen, S. (2007). Free voluntary reading. Santa Barbara, CA: ABC-CLIO, LLC.
Lomicka, L. L. (1998). To gloss or not to gloss: An investigation of reading comprehension online. Language Learning and Technology, 1(2), 41–50.
Johnston, M. (2016). Extensible multimodal annotation for intelligent interactive systems. In D. Dahl (Ed.), Multimodal interaction with W3C standards: Towards natural user interfaces to everything. New York, NY: Springer.
Johnston, M., Baggia, P., Burnett, D., Carter, J., Dahl, D. A., McCobb, G., et al. (2009). EMMA: Extensible MultiModal Annotation markup language. W3C. http://www.w3.org/TR/emma/. Accessed 9 Nov 2012.
Johnston, M., Dahl, D. A., Denny, T., & Kharidi, N. (2015). EMMA: Extensible MultiModal Annotation markup language Version 2.0. World Wide Web Consortium. http://www.w3.org/TR/emma20/. Accessed 16 Dec 2015.
Hunt, A., & McGlashan, S. (2004). W3C Speech Recognition Grammar Specification (SRGS). W3C. http://www.w3.org/TR/speech-grammar/. Accessed 9 Nov 2012.
Stanford Natural Language Processing Group (2014). Stanford CoreNLP. Stanford University. http://nlp.stanford.edu/software/corenlp.shtml.
Galitz, W. O. (2007). The essential guide to user interface design: An introduction to GUI design principles and techniques (3rd ed.). Indianapolis, IN: Wiley Publishing, Inc.
Microsoft (2007). Microsoft Speech API 5.3 (SAPI). http://msdn2.microsoft.com/en-us/library/ms723627.aspx.
Shenoy, A., Wu, Y., & Wang, Y. (2005). Singing voice detection for karaoke application. Paper Presented at the Proceedings SPIE 5960, Visual Communications and Image Processing 2005, Bellingham, WA, USA.
Wilcox, L. (1988). Annotation and segmentation for multimedia indexing and retrieval. In System Sciences, Proceedings of the Thirty-First Hawaii International Conference on System Sciences (Vol. 252), pp. 259–266. doi:10.1109/HICSS.1998.651708.
Lee, K., Hagen, A., Romanyshyn, N., Martin, S., & Pellom, B. (2004). Analysis and detection of reading miscues for interactive literacy tutors. Paper Presented at the Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Dahl, D.A., Dooner, B. (2017). A Case Study of Audio Alignment for Multimedia Language Learning: Applications of SRGS and EMMA in Colibro Publishing. In: Dahl, D. (eds) Multimodal Interaction with W3C Standards. Springer, Cham. https://doi.org/10.1007/978-3-319-42816-1_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-42816-1_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-42814-7
Online ISBN: 978-3-319-42816-1
eBook Packages: EngineeringEngineering (R0)