Multimodal speaker/speech recognition using lip motion, lip texture and audio

Authors:
H. E. Çetingül;E. Erzin;Y. Yemez;A. M. Tekalp
Affiliations:
College of Engineering, Koç University, Sanyer, Istanbul, Turkey;College of Engineering, Koç University, Sanyer, Istanbul, Turkey;College of Engineering, Koç University, Sanyer, Istanbul, Turkey;College of Engineering, Koç University, Sanyer, Istanbul, Turkey
Venue:
Signal Processing - Special section: Multimodal human-computer interfaces
Year:
2006

Citing 16
Cited 9

Fundamentals of speech recognition

Fundamentals of speech recognition
Speaker identification and verification using Gaussian mixture speaker models

Speech Communication
On Combining Classifiers

IEEE Transactions on Pattern Analysis and Machine Intelligence
Extraction of Visual Features for Lipreading

IEEE Transactions on Pattern Analysis and Machine Intelligence
Spoken Language Processing: A Guide to Theory, Algorithm, and System Development

Spoken Language Processing: A Guide to Theory, Algorithm, and System Development
BioID: A Multimodal Biometric Identification System

Computer
Person Identification Using Multiple Cues

IEEE Transactions on Pattern Analysis and Machine Intelligence
Initialized Eigenlip Estimator for Fast Lip Tracking Using Linear Regression

ICPR '00 Proceedings of the International Conference on Pattern Recognition - Volume 3
Information fusion and decision cascading for audio-visual speaker recognition based on time-varying stream reliability prediction

ICME '03 Proceedings of the 2003 International Conference on Multimedia and Expo - Volume 3 (ICME '03) - Volume 03
Audio-visual speech recognition using MPEG-4 compliant visual features

EURASIP Journal on Applied Signal Processing
Automatic speechreading with applications to human-computer interfaces

EURASIP Journal on Applied Signal Processing
Robust lip contour extraction using separability of multi-dimensional distributions

FGR' 04 Proceedings of the Sixth IEEE international conference on Automatic face and gesture recognition
Audio-visual speech modeling for continuous speech recognition

IEEE Transactions on Multimedia
A review of speech-based bimodal recognition

IEEE Transactions on Multimedia
Multimodal speaker identification using an adaptive classifier cascade based on modality reliability

IEEE Transactions on Multimedia
Accurate and quasi-automatic lip tracking

IEEE Transactions on Circuits and Systems for Video Technology

Human Lips as Emerging Biometrics Modality

ICIAR '08 Proceedings of the 5th international conference on Image Analysis and Recognition
Japanese 45 Single Sounds Recognition Using Intraoral Shape

IEICE - Transactions on Information and Systems
Modeling Aspects of Multimodal Lithuanian Human - Machine Interface

Multimodal Signals: Cognitive and Algorithmic Issues
Combining different biometric traits with one-class classification

Signal Processing
Lips Recognition for Biometrics

ICB '09 Proceedings of the Third International Conference on Advances in Biometrics
Dynamic visual features for audio-visual speaker verification

Computer Speech and Language
Multimodal speaker verification based on electroglottograph signal and glottal activity detection

EURASIP Journal on Advances in Signal Processing
Feature Fusion Using Multiple Component Analysis

Neural Processing Letters
Integration of face detection and user identification with visual speech recognition

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part V

Quantified Score

Hi-index	0.01

Visualization

Abstract

We present a new multimodal speaker/speech recognition system that integrates audio, lip texture and lip motion modalities. Fusion of audio and face texture modalities has been investigated in the literature before. The emphasis of this work is to investigate the benefits of inclusion of lip motion modality for two distinct cases: speaker and speech recognition. The audio modality is represented by the well-known mel-frequency cepstral coefficients (MFCC) along with the first and second derivatives, whereas lip texture modality is represented by the 2D-DCT coefficients of the luminance component within a bounding box about the lip region. In this paper, we employ a new lip motion modality representation based on discriminative analysis of the dense motion vectors within the same bounding box for speaker/speech recognition. The fusion of audio, lip texture and lip motion modalities is performed by the so-called reliability weighted summation (RWS) decision rule. Experimental results show that inclusion of lip motion modality provides further performance gains over those which are obtained by fusion of audio and lip texture alone, in both speaker identification and isolated word recognition scenarios.