Audio-visual speech recognition based on AAM parameter and phoneme analysis of visual feature

Authors:
Yuto Komai;Yasuo Ariki;Tetsuya Takiguchi
Affiliations:
Graduate School of System Informatics, Kobe University, Kobe, Hyogo, Japan;Graduate School of System Informatics, Kobe University, Kobe, Hyogo, Japan;Graduate School of System Informatics, Kobe University, Kobe, Hyogo, Japan
Venue:
PSIVT'11 Proceedings of the 5th Pacific Rim conference on Advances in Image and Video Technology - Volume Part I
Year:
2011

Citing 4
Cited 0

Active Appearance Models

ECCV '98 Proceedings of the 5th European Conference on Computer Vision-Volume II - Volume II
Integrating audio and visual information to provide highly robust speech recognition

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 02
Research on Visual Speech Feature Extraction

ICCET '09 Proceedings of the 2009 International Conference on Computer Engineering and Technology - Volume 02
Fast and reliable active appearance model search for 3-D face tracking

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics

Quantified Score

Hi-index	0.00

Visualization

Abstract

As one of the techniques for robust speech recognition under noisy environment, audio-visual speech recognition using lip dynamic visual information together with audio information is attracting attention and the research is advanced in recent years. Since visual information plays a great role in audio-visual speech recognition, what to select as the visual feature becomes a significant point. This paper proposes, for spoken word recognition, to utilize c combined parameter(combined parameter) as the visual feature extracted by Active Appearance Model applied to a face image including the lip area. Combined parameter contains information of the coordinate value and the intensity value as the visual feature. The recognition rate was improved by the proposed feature compared to the conventional features such as DCT and the principal component score. Finally, we integrated the phoneme score from audio information and the viseme score from visual information with high accuracy.