Multi-Modal Speech Recognition Using Optical-Flow Analysis for Lip Images

Authors:
Satoshi Tamura;Koji Iwano;Sadaoki Furui
Affiliations:
Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro-ku, Tokyo 152-8552, Japan;Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro-ku, Tokyo 152-8552, Japan;Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro-ku, Tokyo 152-8552, Japan
Venue:
Journal of VLSI Signal Processing Systems
Year:
2004

Citing 2
Cited 6

Mathematical Techniques in Multisensor Data Fusion

Mathematical Techniques in Multisensor Data Fusion
Speech recognition technology in the ubiquitous/wearable computing environment

ICASSP '00 Proceedings of the Acoustics, Speech, and Signal Processing, 2000. on IEEE International Conference - Volume 06

Audio-visual person authentication using lip-motion from orientation maps

Pattern Recognition Letters
Audio-visual speech recognition using lip information extracted from side-face images

EURASIP Journal on Audio, Speech, and Music Processing
Semi-synchronous speech and pen input for mobile user interfaces

Speech Communication
Fusing data streams in continuous audio-visual speech recognition

TSD'05 Proceedings of the 8th international conference on Text, Speech and Dialogue
Robust visual speakingness detection using bi-level HMM

Pattern Recognition
Audiovisual diarization of people in video content

Multimedia Tools and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a multi-modal speech recognition method using optical-flow analysis for lip images. Optical flow is defined as the distribution of apparent velocities in the movement of brightness patterns in an image. Since the optical flow is computed without extracting the speaker's lip contours and location, robust visual features can be obtained for lip movements. Our method calculates two kinds of visual feature sets in each frame. The first feature set consists of variances of vertical and horizontal components of optical-flow vectors. These are useful for estimating silence/pause periods in noisy conditions since they represent movement of the speaker's mouth. The second feature set consists of maximum and minimum values of integral of the optical flow. These are expected to be more effective than the first set since this feature set has not only silence/pause information but also open/close status of the speaker's mouth. Each of the feature sets is combined with an acoustic feature set in the framework of HMM-based recognition. Triphone HMMs are trained using the combined parameter sets extracted from clean speech data. Noise-corrupted speech recognition experiments have been carried out using audio-visual data from 11 male speakers uttering connected digits. The following improvements of digit accuracy over the audio-only recognition scheme have been achieved when the visual information was used only for silence HMM: 4% at SNR = 5 dB and 13% at SNR = 10 dB using the integral information of optical flow as the visual feature set.