We present a multimodal open-set speaker identification system that integrates information coming from audio, face and lip motion modalities. For fusion of multiple modalities, we propose a new adaptive cascade rule that favors reliable modality combinations through a cascade of classifiers. The order of the classifiers in the cascade is adaptively determined based on the reliability of each modality combination. A novel reliability measure, that genuinely fits to the open-set speaker identification problem, is also proposed to assess accept or reject decisions of a classifier. The proposed adaptive rule is more robust in the presence of unreliable modalities, and outperforms the hard-level max rule and soft-level weighted summation rule, provided that the employed reliability measure is effective in assessment of classifier decisions. Experimental results that support this assertion are provided.
Cite as: Erzin, E., Yemez, Y., Tekalp, A.M. (2004) Adaptive classifier cascade for multimodal speaker identification. Proc. Interspeech 2004, 2493-2496, doi: 10.21437/Interspeech.2004-425
@inproceedings{erzin04_interspeech, author={Engin Erzin and Yucel Yemez and A. Murat Tekalp}, title={{Adaptive classifier cascade for multimodal speaker identification}}, year=2004, booktitle={Proc. Interspeech 2004}, pages={2493--2496}, doi={10.21437/Interspeech.2004-425} }