Abstract
In previous chapters, we have seen how various silent speech interface (SSI) modalities gather information concerning the different stages of speech production, covering brain and muscular activity, articulation, acoustics, and visual speech features. In this chapter, the reader is introduced to the combination of different modalities, not only to drive silent speech interfaces, but also to further enhance the understanding regarding emerging and promising modalities, e.g., ultrasonic Doppler. This approach poses several challenges dealing with the acquisition, synchronization, processing and analysis of the multimodal data. These challenges lead the authors to propose a framework to support research on multimodal silent speech interfaces (SSIs) and to provide concrete examples of its practical application, considering several of the SSI modalities covered in previous chapters. For each example, we propose baseline methods for comparison with the collected data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abreu H (2014) Visual speech recognition for European Portuguese, M.Sc. thesis. University of Minho, Portugal
Cootes TF, Edwards GJ, Taylor CJ (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell 23:681–685. doi:10.1109/34.927467
Cover TM, Thomas JA (2005) Elements of information theory. Wiley, New York. doi:10.1002/047174882X
Denby, B (2013. Down with Sound, the Story of Silent Speech. In: Workshop on Speech production in automatic speech recognition
Denby B, Stone, M (2004) Speech synthesis from real time ultrasound images of the tongue. 2004 IEEE Int. Conf. Acoust. Speech, Signal Process. 1. doi:10.1109/ICASSP.2004.1326078
Dubois C, Otzenberger H, Gounot D, Sock R, Metz-Lutz M-N (2012) Visemic processing in audiovisual discrimination of natural speech: a simultaneous fMRI–EEG study. Neuropsychologia 50:1316–1326
Ferreira A, Figueiredo M (2012) Efficient feature selection filters for high-dimensional data. Pattern Recognit Lett 33:1794–1804. doi:10.1016/j.patrec.2012.05.019
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Hum Genet 7:179–188. doi:10.1111/j.1469-1809.1936.tb02137.x
Florescu VM, Crevier-Buchman L, Denby B, Hueber T, Colazo-Simon A, Pillot-Loiseau C, Roussel-Ragot P, Gendrot C, Quattrocchi S (2010) Silent vs vocalized articulation for a portable ultrasound-based silent speech interface. In: Proceedings of Interspeech 2010, pp 450–453
Freitas J, Teixeira A, Dias MS (2012a) Towards a silent speech interface for Portuguese: surface electromyography and the nasality challenge. In: International conference on bio-inspired systems and signal processing (BIOSIGNALS 2012), pp 91–100
Freitas J, Teixeira A, Vaz F, Dias MS (2012b) Automatic speech recognition based on ultrasonic doppler sensing for European Portuguese. In: Advances in speech and language technologies for Iberian languages, communications in computer and information science. Springer, Berlin, pp 227–236. doi:10.1007/978-3-642-35292-8_24
Freitas J, Ferreira A, Figueiredo M, Teixeira A, Dias MS (2014a) Enhancing multimodal silent speech interfaces with feature selection. In: 15th Annual Conf. of the Int. Speech Communication Association (Interspeech 2014), Singapore, pp. 1169–1173
Freitas J, Teixeira A, Dias MS (2014b) Multimodal corpora for silent speech interaction. In: 9th Language resources and evaluation conference, pp 1–5
Freitas J, Teixeira A, Silva S, Oliveira C, Dias MS (2014c) Assessing the applicability of surface EMG to tongue gesture detection. In: Proceedings of IberSPEECH 2014, lecture notes in artificial intelligence (LNAI). Springer, Berlin, pp 189–198
Freitas J, Teixeira A, Silva S, Oliveira C, Dias MS (2014d) Velum movement detection based on surface electromyography for speech interface. In: International conference on bio-inspired systems and signal processing (BIOSIGNALS 2014), pp 13–20
Freitas J, Teixeira A, Silva S, Oliveira C, Dias MS (2015) Detecting nasal vowels in speech interfaces based on surface electromyography. PLoS One 10, e0127040. doi:10.1371/journal.pone.0127040
Galatas G, Potamianos G, Makedon F (2012a) Audio-visual speech recognition incorporating facial depth information captured by the Kinect. In: 20th European signal processing conference, pp 2714–2717
Galatas G, Potamianos G, Makedon F (2012b) Audio-visual speech recognition using depth information from the kinect in noisy video condition. In: Proceedings of the 5th International conference on pervasive technologies related to assistive environments—PETRA’12, pp 1–4. doi:10.1145/2413097.2413100
Gurban M, Thiran J-P (2009) Information theoretic feature extraction for audio-visual speech recognition. IEEE Trans Signal Process 57:4765–4776. doi:10.1109/TSP.2009.2026513
Hofe R, Bai J, Cheah LA, Ell SR, Gilbert JM, Moore RK, Green PD (2013) Performance of the MVOCA silent speech interface across multiple speakers. In: Proc. of Interspeech, 2013, pp. 1140–1143
Holzrichter JF, Burnett GC, Ng LC, Lea WA (1998) Speech articulator measurements using low power EM-wave sensors. J Acoust Soc Am. doi:10.1121/1.421133
Instruments, A (2014) Articulate assistant advanced ultrasound module user manual, Revision 212. Articulate Instruments, Edinburgh
Kalgaonkar K, Hu RHR, Raj B (2007) Ultrasonic Doppler sensor for voice activity detection. IEEE Signal Proc Lett 14:754–757. doi:10.1109/LSP.2007.896450
Lahr RJ (2006) Head-worn, trimodal device to increase transcription accuracy in a voice recognition system and to process unvocalized speech. US 7082393 B2
Narayanan S, Bresch E, Ghoosh P, Goldstein L, Katsamanis A, Kim Y, Lammert AC, Proctor M, Ramanarayanan V, Zhu Y (2011) A multimodal real-time MRI articulatory corpus for speech research. In: Proc. Interspeech, 2011, pp. 837–840
Oppenheim AV, Schafer RW, Buck JR (1999) Discrete time signal processing. Prentice-Hall, Upper Saddle River
Plux Wireless Biosignals (n.d.) www.plux.info/. Accessed 30 October 2014
Potamianos G, Neti C, Gravier G, Garg A, Senior AW (2003) Recent advances in the automatic recognition of audiovisual speech. Proc IEEE 91:1306–1326
Scobbie JM, Wrench AA, van der Linden M (2008) Head-probe stabilisation in ultrasound tongue imaging using a headset to permit natural head movement. In: Proceedings of the 8th International seminar on speech production, pp 373–376
Silva S, Teixeira A (2014) Automatic annotation of an ultrasound corpus for studying tongue movement. In: Proc. ICIAR, LNCS 8814. Springer, Vilamoura, pp. 469–476
Srinivasan S, Raj B, Ezzat T (2010) Ultrasonic sensing for robust speech recognition. In: IEEE int. conf. on acoustics, speech and signal processing (ICASSP 2010). doi:10.1109/ICASSP.2010.5495039
Stone M, Lundberg A (1996) Three-dimensional tongue surface shapes of English consonants and vowels. J Acoust Soc Am 99:3728–3737. doi:10.1121/1.414969
Tran V-A, Bailly G, Lœvenbruck H, Toda T (2009) Multimodal HMM-based NAM-to-speech conversion. Interspeech 2009:656–659
Tran VA, Bailly G, Loevenbruck H, Toda T (2010) Improvement to a NAM-captured whisper-to-speech system. Speech Commun 52:314–326. doi:10.1016/j.specom.2009.11.005
Wand M, Schultz T (2011) Session-independent EMG-based Speech Recognition. In: International conference on bio-inspired systems and signal processing (BIOSIGNALS 2011), pp 295–300
Yau WC, Arjunan SP, Kumar DK (2008) Classification of voiceless speech using facial muscle activity and vision based techniques. TENCON 2008–2008 IEEE Reg. 10 Conf. doi:10.1109/TENCON.2008.4766822
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2017 The Author(s)
About this chapter
Cite this chapter
Freitas, J., Teixeira, A., Dias, M.S., Silva, S. (2017). Combining Modalities: Multimodal SSI. In: An Introduction to Silent Speech Interfaces. SpringerBriefs in Electrical and Computer Engineering(). Springer, Cham. https://doi.org/10.1007/978-3-319-40174-4_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-40174-4_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40173-7
Online ISBN: 978-3-319-40174-4
eBook Packages: EngineeringEngineering (R0)