Abstract
Mainstream automatic speech recognition has focused almost exclusively on the acoustic signal. The performance of these systems degrades considerably in the real word in the presence of noise. It was needed novel approaches that use other orthogonal sources of information to the acoustic input that not only considerably improve the performance in severely degraded conditions, but also are independent to the type of noise and reverberation. Visual speech is one such source not perturbed by the acoustic environment and noise. In this paper, it was presented own approach to lip-tracking and fusion of signals audio and video for audio-visual speech and speaker recognition system. It was presented video analysis of visual speech for extraction visual features from a talking person in color video sequences. It was developed a method for automatically localization of face, eyes, region of mouth, corners and contour of mouth. It was proposed synchronous and two asynchronous of methods of fusion of signals audio and video. Finally, the paper will show results of lip-tracking depending on various factors (lighting, beard), results of speech and speaker recognition in noisy environments.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Herda, L., Fua, P., Plankers, R., Boulic, R., Thalmann, D.: Skeleton-based motion capture for robust reconstruction of human motion. Proceedings Computer Animation 2000, pp. 77–83, 2000
Aydin, Y., Nakajama, H.: Realistic articulated character positioning and balance control in interactive environments. Proceedings Computer Animation 1999, pp. 160–168, 1999
Zhi, Q., Kaynak, M. N. N., Sengupta, K., Cheok, A. D., Ko, C. C.: A study of the modeling aspects in bimodal speech recognition. Proc. 2001 IEEE International Conference on Multimedia and Expo (ICME2001), 2001
Jian, Z., Kaynak, M. N. N., Cheok, A. D., Chung, K. C.: Real-time Lip-tracking For Virtual Lip Implementation in Virtual Environments and Computer Games. Proc. 2001 International Fuzzy Systems Conference, 2001
Neti, C, Potamianos, G., Luttin, J., Mattews, I., Glotin, H., Vergyri, D., Sison, J., Mashari, A., Zhou, J.: Audio Visual Speech-Recognition. Workshop 2000 Final Report, October 12, 2000
McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature, 264:746–748, 1976
Massaro, D. W., Stork, D. G.: Speech recognition and sensory integration. American Scientist, 86(3):236–244, 1998
Hennecke, M. E., Stork, D. G., Prasad, K. V.: Visionary speech: Looking ahead to practical speechreading systems. In Stork and Hennecke [91], pages 331–349
Steifelhagen, R., Meier, U., Yang, J.: Real-Time Lip-Tranking for Lipreading.
Kuchariev, G., Kuźmiński, A.: Biometric technique. Part 1: Methods of face recognition. Departament of Computer Science, Szczecin University of Technology, 2003
Gee, A. H., Cipolla, R.: Fast visual tracking by temporal consensus. Technical Report CUED/F-INFENG/TR-207, University of Cambridge, February 1995
Basu, S., Oliver, N., Pentland, A.: 3D modeling and tracking of human lip motions. In Proc. International Conference on Computer Vision, 1998
Chan, M. T., Zhang, Y., Huang, T. S.: Real-time lip-tracking and bimodal continuous speech recognition. In Proc. IEEE 2nd Workshop on Multimedia Signal Processing, pages 65–70, Redondo Beach, 1988
Kubanek, M.: Method of edge EDGE to extraction of features of image of mouth in technique of integrated recognizing of speech audio-video. Information Sciences, Publisher of Czestochowa University of Technology, Czestochowa 2003, nr 4, s. 115–125
Kaucic, R., Dalton, B., Blake, A.: Real-Time Lip Tracking for Audio-Visual Speech Recognition Applications. In Proc. European Conf. Computer Vision, pp. 376–387, Cambridge, UK, 1996
Summerfield, Q., MacLeod, A., McGrath, M., Broke, M.: Lips, teeth and the benefits of lipreading. In A. W. Young and H. D. Ellis, editors, Handbook of Research on Face Processing, pp. 223–233, Elsevier Science Publishers, 1989
Luttein, J.: Visual Speech and Speaker Recognition. Dissertation submitted to the University of Sheffield for the degree of Doctor of Philosophy, May 1997
Rabiner, L., Yuang, B. H.: Fundamentals of Speech Recognition. Prentice Hall Signal Processing Series, 1993
Kaynak, M. N. N., Zhi, Q, Cheok, A. D., Sengupta, K., Chung, K. C.: Audio-Visual Modeling for Bimodal Speech Recognition. Proc. 2001 International Fuzzy Systems Conference, 2001
Bogert, B. P., Healy, M. J. R., Tukey, J. W.: The Frequency Analysis of Time-Series for Echoes. Proc. Symp. Time Series Analysis, 1963, Chap, pp. 209–243
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer Science+Business Media, LLC
About this paper
Cite this paper
Kubanek, M. (2006). Method of Speech Recognition and Speaker Identification using Audio-Visual of Polish Speech and Hidden Markov Models. In: Saeed, K., Pejaś, J., Mosdorf, R. (eds) Biometrics, Computer Security Systems and Artificial Intelligence Applications. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-36503-9_5
Download citation
DOI: https://doi.org/10.1007/978-0-387-36503-9_5
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-36232-8
Online ISBN: 978-0-387-36503-9
eBook Packages: Computer ScienceComputer Science (R0)