Evolving connectionist method for adaptive audiovisual speech recognition

Malcangi, Mario; Grew, Philip

doi:10.1007/s12530-016-9156-6

Evolving connectionist method for adaptive audiovisual speech recognition

Original Paper
Published: 07 July 2016

Volume 8, pages 85–94, (2017)
Cite this article

Evolving Systems Aims and scope Submit manuscript

Mario Malcangi¹ &
Philip Grew¹

285 Accesses
4 Citations
Explore all metrics

Abstract

Reliability is the primary requirement in noisy conditions and for highly variable utterances. Integrating the recognition of visual signals with the recognition of audio signals is indispensable for many applications that require automatic speech recognition (ASR) in harsh conditions. Several important experiments have shown that integrating and adapting to multiple behavioral end-context information during the speech-recognition task significantly improves its success rate. By integrating audio and visual data from speech information, we can improve the performance of an ASR system by differentiating between the most critical cases of phonetic-unit mismatch that occur when processing audio or visual input alone. The evolving fuzzy neural-network (EFuNN) inference method is applied at the decision layer to accomplish this task. This is done through a paradigm that adapts to the environment by changing structure. The EFuNN’s capacity to learn quickly from incoming data and to adapt while on line lowers the ASR system’s complexity and enhances its performance in harsh conditions. Two independent feature extractors were developed, one for speech phonetics (listening to the speech) and the other for speech visemics (lip-reading the spoken input). The EFuNN network was trained to fuse decisions made disjointly by the audio unit and the visual unit. Our experiments have confirmed that the proposed method is reliable for developing a robust, automatic, speech-recognition system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

A Deep Learning Framework for Audio Deepfake Detection

Article 08 November 2021

References

Badura S, Frátrik M, Škvarek O, Klimo M (2014) Bimodal vowel recognition using fuzzy logic networks - naive approach. ELEKTRO, 2014, pp. 22–25, IEEE
Basu S, Neti C, Senior A, Rajput N, Subramaniam A, Verma A (1999) Audio-visual large vocabulary continuous speech recognition in the broadcast domain. In: IEEE Workshop on Multimedia Signal Processing. pp. 475–481
Benoît C, Guiard-Marigny T, Le Goff B, Adjoudani A (1996) Which components of the face do humans and machines best speechread? In: Stork DG, Hennecke ME (eds) speechreading by humans and machines: models, systems, and applications. Springer-Verlag, New York, pp 315–328
Chapter Google Scholar
Bernstein LE, Auer ET Jr (1996) Word Recognition in Speechreading. In: Stork DG, Hennecke ME (eds) In speechreading by humans and machines: models, systems, and applications. Springer-Verlag, New York, pp 17–26
Chapter Google Scholar
Cappelletta L, Harte N (2012) Phoneme-to viseme mapping for visual speech recognition. Proceeding of the 2012 International Conference on Pattern Recognition Applications and Methods (ICPRAM 2012). (2012)
Dupont S, Luettin J (2000) Audio-visual speech modeling for continuous speech recognition. IEEE Trans Multimedia 2(3):141–151
Article Google Scholar
http://www.kedri.aut.ac.nz. Accessed 6 July 2016
Joo MG (2003) A method of converting conventional fuzzy logic system to 2 layered hierarchical fuzzy system. In: Proceedings of the IEEE International Conference on Fuzzy Systems, pp. 1357–1362
Kasabov N (1998) Evolving fuzzy neural networks—algorithms, applications and biological motivation. In: Yamakawa and Matsumoto (eds), Methodologies for the conception, design and application of soft computing, World Computing, pp. 271–274
Kasabov N (2001) Evolving fuzzy neural networks for online, adaptive, knowledge-based learning. IEEE Trans Syst Man Cybern B 31(6):902–918
Article Google Scholar
Kasabov N, Postma K, Van den Herik EJ (2000) AVIS: a connectionist-based framework for integrated auditory and visual information processing. Info Sci J 123(1–2):127–148
Article MATH Google Scholar
Kaucic R, Dalton R, Blake A (1996) Real-time lip tracking for audio-visual speech recognition applications. Proc Eur Conf Comput Vision II:376–387
Google Scholar
Kölsch M, Bane R, Höllerer T, Turk M (2006) Touching the visualized invisible: wearable AR with a multimodal interface. IEEE Computer Graphics and Applications, May/June 2006
Malcangi M, de Tintis R (2004) Audio based real-time speech animation of embodied conversational agents. In: Lecture Notes in Computer Science, Vol. 2915, Gesture-based communication in human-computer interaction, pp. 429–430
Malcangi M, Ouazzane K, Patel P (2013) Audio-visual fuzzy fusion for robust speech recognition. In: The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 582–589
Marshall J, Tennent P (2013) “Mobile interaction does not exist” in chi ‘13 extended abstracts on human factors in computing systems (CHI EA ‘13). ACM, New York, pp 2069–2078
Google Scholar
Massaro DW (1996) Bimodal Speech Perception: A Progress Report. In: Stork DG, Hennecke ME (eds) In speechreading by humans and machines: models, systems, and applications. Springer-Verlag, New York, pp 79–101
Chapter Google Scholar
McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748
Article Google Scholar
Noda K, Yamaguchi Y, Hiroshi K, Okuno G, Ogata T (2015a) Audio-visual speech recognition using deep learning. Appl Intell Springer 42(4):722–737
Article Google Scholar
Noda K, Yamaguchi Y, Nakadai K, Okuno HG, Ogata T (2015b) An asynchronous DBN for audio-visual speech recognition. Applied Intelligence 42(4):722–737
Article Google Scholar
Patel P, Ouazzane K, Whitrow R (2005) Automated visual feature extraction for bimodal speech recognition. In: Proceedings of IADAT-micv2005. pp. 118–122
Petajan ED (1984) Automatic lipreading to enhance speech recognition. In: IEEE Communication Society Global Telecommunications Conference
Salama ES, El-khoribi RA, Shoman ME (2014) Audio-visual speech recognition for people with speech disorders. Int J Comput Appl 96(2):51–56
Sinclair S, Watson C (1995) The development of the Otago speech database. In: Kasabov K, Coghill G (eds). Proceedings of ANNES’95, IEEE Computer Society Press
Steifelhagen R, Yang J, Meier U (1997) Real time lip tracking for lipreading. In: Proceedings of Eurospeech
Stork DG, Wolff GJ, Levine EP (1992) Neural network lipreading system for improved speech recognition. In Proceedings International Joint Conf. on Neural Networks, vol. 2, pp. 289–295
Watts MJ (2009) A decade of Kasabov’s Evolving Connectionist Systems: a Review. IEEE Transactions on Systems, Man and Cybernetics Part C. Applications and Reviews (2009) 39(3):253–269
MathSciNet Google Scholar
Watts M, Kasabov N (2000) Simple evolving connectionist systems and experiments on isolated phoneme recognition. In: Proceedings of the first IEEE conference on evolutionary computation and neural networks, San Antonio, pp. 232–239, IEEE Press
Wright D, Wareham G (2005) Mixing sound and vision: the interaction of auditory and visual information for earwitnesses of a crime scene. Legal Criminol Psychol 10(1):103–108
Article Google Scholar
Yang J, Waibel A (1996) A real-time face tracker. In: Proc WACV. pp. 142–147

Download references

Acknowledgments

The audio and video hardware and software data acquisition setup was provided by STMicroelectronics. Special acknowledgement is due Dr. Claudio Marchisio for expertise on the visual components of the system and Dr. Roberto Sannino for expertise on the audio components.

Author information

Authors and Affiliations

Department of Computer Science, Università degli Studi di Milano, Milan, Italy
Mario Malcangi & Philip Grew

Authors

Mario Malcangi
View author publications
You can also search for this author in PubMed Google Scholar
Philip Grew
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mario Malcangi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Malcangi, M., Grew, P. Evolving connectionist method for adaptive audiovisual speech recognition. Evolving Systems 8, 85–94 (2017). https://doi.org/10.1007/s12530-016-9156-6

Download citation

Received: 15 January 2016
Accepted: 18 June 2016
Published: 07 July 2016
Issue Date: March 2017
DOI: https://doi.org/10.1007/s12530-016-9156-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evolving connectionist method for adaptive audiovisual speech recognition

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

A Deep Learning Framework for Audio Deepfake Detection

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Evolving connectionist method for adaptive audiovisual speech recognition

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

A Deep Learning Framework for Audio Deepfake Detection

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation