Skip to main content
Log in

Evolving connectionist method for adaptive audiovisual speech recognition

  • Original Paper
  • Published:
Evolving Systems Aims and scope Submit manuscript

Abstract

Reliability is the primary requirement in noisy conditions and for highly variable utterances. Integrating the recognition of visual signals with the recognition of audio signals is indispensable for many applications that require automatic speech recognition (ASR) in harsh conditions. Several important experiments have shown that integrating and adapting to multiple behavioral end-context information during the speech-recognition task significantly improves its success rate. By integrating audio and visual data from speech information, we can improve the performance of an ASR system by differentiating between the most critical cases of phonetic-unit mismatch that occur when processing audio or visual input alone. The evolving fuzzy neural-network (EFuNN) inference method is applied at the decision layer to accomplish this task. This is done through a paradigm that adapts to the environment by changing structure. The EFuNN’s capacity to learn quickly from incoming data and to adapt while on line lowers the ASR system’s complexity and enhances its performance in harsh conditions. Two independent feature extractors were developed, one for speech phonetics (listening to the speech) and the other for speech visemics (lip-reading the spoken input). The EFuNN network was trained to fuse decisions made disjointly by the audio unit and the visual unit. Our experiments have confirmed that the proposed method is reliable for developing a robust, automatic, speech-recognition system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  • Badura S, Frátrik M, Škvarek O, Klimo M (2014) Bimodal vowel recognition using fuzzy logic networks - naive approach. ELEKTRO, 2014, pp. 22–25, IEEE

  • Basu S, Neti C, Senior A, Rajput N, Subramaniam A, Verma A (1999) Audio-visual large vocabulary continuous speech recognition in the broadcast domain. In: IEEE Workshop on Multimedia Signal Processing. pp. 475–481

  • Benoît C, Guiard-Marigny T, Le Goff B, Adjoudani A (1996) Which components of the face do humans and machines best speechread? In: Stork DG, Hennecke ME (eds) speechreading by humans and machines: models, systems, and applications. Springer-Verlag, New York, pp 315–328

    Chapter  Google Scholar 

  • Bernstein LE, Auer ET Jr (1996) Word Recognition in Speechreading. In: Stork DG, Hennecke ME (eds) In speechreading by humans and machines: models, systems, and applications. Springer-Verlag, New York, pp 17–26

    Chapter  Google Scholar 

  • Cappelletta L, Harte N (2012) Phoneme-to viseme mapping for visual speech recognition. Proceeding of the 2012 International Conference on Pattern Recognition Applications and Methods (ICPRAM 2012). (2012)

  • Dupont S, Luettin J (2000) Audio-visual speech modeling for continuous speech recognition. IEEE Trans Multimedia 2(3):141–151

    Article  Google Scholar 

  • http://www.kedri.aut.ac.nz. Accessed 6 July 2016

  • Joo MG (2003) A method of converting conventional fuzzy logic system to 2 layered hierarchical fuzzy system. In: Proceedings of the IEEE International Conference on Fuzzy Systems, pp. 1357–1362

  • Kasabov N (1998) Evolving fuzzy neural networks—algorithms, applications and biological motivation. In: Yamakawa and Matsumoto (eds), Methodologies for the conception, design and application of soft computing, World Computing, pp. 271–274

  • Kasabov N (2001) Evolving fuzzy neural networks for online, adaptive, knowledge-based learning. IEEE Trans Syst Man Cybern B 31(6):902–918

    Article  Google Scholar 

  • Kasabov N, Postma K, Van den Herik EJ (2000) AVIS: a connectionist-based framework for integrated auditory and visual information processing. Info Sci J 123(1–2):127–148

    Article  MATH  Google Scholar 

  • Kaucic R, Dalton R, Blake A (1996) Real-time lip tracking for audio-visual speech recognition applications. Proc Eur Conf Comput Vision II:376–387

    Google Scholar 

  • Kölsch M, Bane R, Höllerer T, Turk M (2006) Touching the visualized invisible: wearable AR with a multimodal interface. IEEE Computer Graphics and Applications, May/June 2006

  • Malcangi M, de Tintis R (2004) Audio based real-time speech animation of embodied conversational agents. In: Lecture Notes in Computer Science, Vol. 2915, Gesture-based communication in human-computer interaction, pp. 429–430

  • Malcangi M, Ouazzane K, Patel P (2013) Audio-visual fuzzy fusion for robust speech recognition. In: The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 582–589

  • Marshall J, Tennent P (2013) “Mobile interaction does not exist” in chi ‘13 extended abstracts on human factors in computing systems (CHI EA ‘13). ACM, New York, pp 2069–2078

    Google Scholar 

  • Massaro DW (1996) Bimodal Speech Perception: A Progress Report. In: Stork DG, Hennecke ME (eds) In speechreading by humans and machines: models, systems, and applications. Springer-Verlag, New York, pp 79–101

    Chapter  Google Scholar 

  • McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748

    Article  Google Scholar 

  • Noda K, Yamaguchi Y, Hiroshi K, Okuno G, Ogata T (2015a) Audio-visual speech recognition using deep learning. Appl Intell Springer 42(4):722–737

    Article  Google Scholar 

  • Noda K, Yamaguchi Y, Nakadai K, Okuno HG, Ogata T (2015b) An asynchronous DBN for audio-visual speech recognition. Applied Intelligence 42(4):722–737

    Article  Google Scholar 

  • Patel P, Ouazzane K, Whitrow R (2005) Automated visual feature extraction for bimodal speech recognition. In: Proceedings of IADAT-micv2005. pp. 118–122

  • Petajan ED (1984) Automatic lipreading to enhance speech recognition. In: IEEE Communication Society Global Telecommunications Conference

  • Salama ES, El-khoribi RA, Shoman ME (2014) Audio-visual speech recognition for people with speech disorders. Int J Comput Appl 96(2):51–56

  • Sinclair S, Watson C (1995) The development of the Otago speech database. In: Kasabov K, Coghill G (eds). Proceedings of ANNES’95, IEEE Computer Society Press

  • Steifelhagen R, Yang J, Meier U (1997) Real time lip tracking for lipreading. In: Proceedings of Eurospeech

  • Stork DG, Wolff GJ, Levine EP (1992) Neural network lipreading system for improved speech recognition. In Proceedings International Joint Conf. on Neural Networks, vol. 2, pp. 289–295

  • Watts MJ (2009) A decade of Kasabov’s Evolving Connectionist Systems: a Review. IEEE Transactions on Systems, Man and Cybernetics Part C. Applications and Reviews (2009) 39(3):253–269

    MathSciNet  Google Scholar 

  • Watts M, Kasabov N (2000) Simple evolving connectionist systems and experiments on isolated phoneme recognition. In: Proceedings of the first IEEE conference on evolutionary computation and neural networks, San Antonio, pp. 232–239, IEEE Press

  • Wright D, Wareham G (2005) Mixing sound and vision: the interaction of auditory and visual information for earwitnesses of a crime scene. Legal Criminol Psychol 10(1):103–108

    Article  Google Scholar 

  • Yang J, Waibel A (1996) A real-time face tracker. In: Proc WACV. pp. 142–147

Download references

Acknowledgments

The audio and video hardware and software data acquisition setup was provided by STMicroelectronics. Special acknowledgement is due Dr. Claudio Marchisio for expertise on the visual components of the system and Dr. Roberto Sannino for expertise on the audio components.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mario Malcangi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Malcangi, M., Grew, P. Evolving connectionist method for adaptive audiovisual speech recognition. Evolving Systems 8, 85–94 (2017). https://doi.org/10.1007/s12530-016-9156-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12530-016-9156-6

Keywords

Navigation