Abstract
In this paper, we propose a set of automatic stress exaggeration methods that can enlarge the differences between stressed and unstressed syllables. Our stress exaggeration methods can be used in computer-aided language learning systems to assist second language learners perceive stress patterns. The intention of our automatic stress exaggeration methods is to support hyper-pronunciation training which is commonly used in classrooms by teachers. In hyper-pronunciation training, exaggeration is used to help learners increase their awareness of acoustic features and effectively apply these features into their pronunciation. Duration, pitch and intensity have been claimed to be the main acoustic features that are closely related to stress in English language. Thus, four stress exaggeration methods are proposed in this paper: (i) duration-based stress exaggeration, (ii) pitch-based stress exaggeration, (iii) intensity-based stress exaggeration, and (iv) a combined stress exaggeration method that integrates the duration-based, pitch-based and intensity-based exaggeration methods. Our perceptual experimental results show that resynthesised stimuli by our proposed stress exaggerated methods can help learners of English as a Second Language (ESL) better perceive English stress patterns significantly.
Similar content being viewed by others
References
Akahane-Yamada, R., Tohkura, Y., Bradlow, A. R., & Pisoni, D. B. (1996). Does training in speech perception modify speech production. In Proceedings of international conference on spoken language processing (Vols. 1–4, pp. 606–609).
Ananthakrishnan, S., & Narayanan, S. (2008). Automatic prosodic event detection using acoustic, lexical, and syntactic evidence. IEEE Trans. Audio, Speech and Language Processing, 16(1), 216–228.
Beskow, J., & Sjölander, K. (2000). WaveSurfer—a public domain speech tool. In Proceedings of international conference on spoken language processing, China, Beijing (Vol. 4, pp. 464–467).
Bissiri, M. P., & Pfitzinger, H. R. (2009). Italian speakers learn lexical stress of German morphologically complex words. Speech Communication, 51(10), 933–947.
Black, A. (2007). Speech synthesis for educational technology. In Proceedings of workshop on speech and language technology in education (pp. 104–107).
Bond, Z. (1999). Slips of the ear: errors in the perception of casual conversation. San Diego: Academic Press.
Bond, Z., & Small, L. H. (1983). Voicing, vowel and stress mispronunciations in continuous speech. Perception and Psychophysics, 34, 470–474.
Bradlow, A., Pisoni, D., Akahana-Yamada, R., & Tohkura, Y. (1997). Training Japanese listeners to identify English /r/ and /l/: Some effects of perceptual learning on speech production. Journal of the Acoustical Society of America, 101(4), 2299–2310.
Dalton, C., & Seidlhofer, B. (1994). Pronunciation. Oxford: Oxford University Press.
Dauer, R. M. (1983). Stress-timing and syllable-timing reanalyzed. Journal of Phonetics, 11, 51–62.
Delmonte, R. (2000). SLIM prosodic automatic tools for self-learning instruction. Speech Communication, 30, 145–166.
Delmonte, R. (2009). Prosodic tools for language learning. International Journal of Speech Technology, 12(4), 161–184.
Dupoux, E., Pallier, C., Sebastián-Gallés, N., & Mehler, J. (1997). A destressing ‘deafness’ in French?. Journal of Memory and Language, 36, 406–421.
Engelbrecht, K. P., Quade, M., & Möller, S. (2009). Analysis of a new simulation approach to dialog system evaluation. Speech Communication, 51, 1234–1252.
Eskenazi, M. (2009). An overview of spoken language technology for education. Speech Communication, 51(10), 832–844.
Eskenazi, M., & Hansma, S. (1998). The Fluency pronunciation trainer. In Proceedings of speech technology in language learning (pp. 77–80).
Fant, G. (1960). Acoustic theory of speech production. Moutons’Gravenhage.
Felps, D., Bortfeld, H., & Gutierrez-Osuna, R. (2009). Foreign accent conversion in computer assisted pronunciation training. Speech Communication, 51(10), 920–932.
Field, J. (2005). Intelligibility and the listener: the role of lexical stress. TESOL. Quarterly, 39, 399–423.
Fry, D. B. (1955). Duration and intensity as physical correlates of linguistic stress. Journal of the Acoustical Society of America, 27, 765–768.
Hahn, L. D. (2004). Primary stress and intelligibility: Research to motivate the teaching of suprasegmentals. TESOL Quarterly, 38(2), 201–223.
Hincks, R. (2002). Speech synthesis for teaching lexical stress. TMH-QPSR, 44, 153–156.
Hirose, K. (2004). Accent type recognition of Japanese using perceived mora pitch values and its use for pronunciation training system. In Proceedings of international symposium on tonal aspects of languages, Beijing (pp. 77–80).
Kang, M., Kashiwagi, H., Treviranus, J., & Kaburagi, M. (2008). Synthetic speech in foreign language learning: an evaluation by learners. International Journal of Speech Technology, 11(2), 97–106.
Lu, J., Wang, R., De Silva, L. C., Gao, Y., & Liu, J. (2010). CASTLE: a Computer-assisted stress teaching and learning environment for learners of English as a second language. In InterSpeech, Makuhari, Japan (pp. 606–609).
MIT courseware (2006). Transcribing Prosodic Structure of Spoken Utterances with ToBI http://ocw.mit.edu/OcwWeb. Accessed on 15/08/2009.
Moulines, E., & Charpentier, F. (1990). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9(5–6), 453–467.
Nagamine, T. (2002). An experimental study on the teachability and learnability of English intonational aspect: Acoustic analysis on F0 and native-speaker judgment task. Journal of Language and Linguistics, 1(4), 362–399.
Nolan, F. (2003). Intonational equivalence: an experimental evaluation of pitch scales. In Proceedings of the 15th international congress of phonetic sciences, Barcelona (pp. 771–774).
Ostendorf, M., Price, P. J., & Shattuck-Hufnagel, S. (1995). The Boston University radio news corpus. Boston Univ., Boston, MA, Tech. Rep. ECS-95-001, Mar.
Peperkamp, S., & Dupoux, E. (2002). A typological study of stress ‘deafness’. In C. Gussenhoven & N. Warner (Eds.), Laboratory Phonology 7 (pp. 203–240). Berlin: Mouton de Gruyter.
Raux, A., & Black, A. W. (2003). A unit selection approach to F0 modeling and its application to emphasis. In Proceedings of IEEE workshop on automatic speech recognition and understanding (pp. 700–705).
Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., & Hirschberg, J. (1992). Tobi: a standard for labeling english prosody. In Proceedings of international conference on spoken language processing (pp. 867–870).
Sluijter, A. M. C., van Heuven, V. J., & Pacilly, J. J. A. (1997). Spectral Balance as a cue in the perception of linguistic stress. J. Acoust. Soc. Amer., 101, 503–513.
Solé Sabater, M. J. (1991). Stress and Rhythm in English. Revista Alicantina de Estudios Ingleses, 4, 145–162.
Sundström, A. (1998). Automatic prosody modification as a means for foreign language pronunciation training. In Proceedings of ISCA workshop on speech technology in language learning (STILL 98), Marholmen, Sweden (pp. 49–52).
Tamburini, F., & Caini, C. (2005). An automatic system for detecting prosodic prominence in American English continuous speech. International Journal of speech technology, 8(1), 33–44.
Toda, T., Black, A. W., & Tokuda, K. (2007). Voice conversion based on maximum likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech and Language Processing, 15(8), 2222–2235.
Todaka, Y. (1995). A preliminary study of voice quality differences between Japanese and American English: Some pedagogical suggestions. JALT Journal, 17(2), 261–268.
Wang, C., & Seneff, S. (2006). High-quality speech-to-speech translation for computer-aided language learning. ACM Transactions on Speech and Language Processing, 3(2), 1–21.
Wang, R., & Lu, J. (2011). Investigation of the golden speaker for a language learner from the imitation preference perspective by voice modification. Speech Communication, 53, 175–184.
WWW (2011). Voices materials on http://www.box.net/shared/srmu4tjj9f.
Xie, H., Andreae, P., Zhang, M., & Warren, P. (2004). Detecting stress in spoken English using decision trees and support vector machines. Australian Computer Science Communications (Data Mining, CRPIT 32), 26(7), 145–150.
Yoon, K. (2008). Synthesis and evaluation of prosodically exaggerated utterances: a preliminary study. In Proceedings of conference of the association of modern British & American language & literature.
Zielinski, B. W. (2008). The listener: No longer the silent partner in reduced intelligibility. System, 36, 69–84.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lu, J., Wang, R. & De Silva, L.C. Automatic stress exaggeration by prosody modification to assist language learners perceive sentence stress. Int J Speech Technol 15, 87–98 (2012). https://doi.org/10.1007/s10772-011-9124-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-011-9124-2