Adaptive fusion of acoustic and visual sources for automatic speech recognition
Speech Communication - Special issue on auditory-visual speech processing
Multi-stream adaptive evidence combination for noise robust ASR
Speech Communication - Special issue on noise robust ASR
Real-Time Lip Tracking for Audio-Visual Speech Recognition Applications
ECCV '96 Proceedings of the 4th European Conference on Computer Vision-Volume II - Volume II
Acoustic-labial Speaker Verification
AVBPA '97 Proceedings of the First International Conference on Audio- and Video-Based Biometric Person Authentication
Face Identification by Deformation Measure
ICPR '96 Proceedings of the International Conference on Pattern Recognition (ICPR '96) Volume III-Volume 7276 - Volume 7276
Tracking of Deformable Contours by Synthesis and Match
ICPR '96 Proceedings of the 1996 International Conference on Pattern Recognition (ICPR '96) Volume I - Volume 7270
3D Modeling and Tracking of Human Lip Motions
ICCV '98 Proceedings of the Sixth International Conference on Computer Vision
Accurate, Real-Time, Unadorned Lip Tracking
ICCV '98 Proceedings of the Sixth International Conference on Computer Vision
Unsupervised lip segmentation under natural conditions
ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 06
Automatic speechreading with applications to human-computer interfaces
EURASIP Journal on Applied Signal Processing
Audio-visual speech modeling for continuous speech recognition
IEEE Transactions on Multimedia
Hi-index | 0.00 |
We aim at modeling the appearance of the lower face region to assist visual feature extraction for audio-visual speech processing applications. In this paper, we present a neural network based statistical appearance model of the lips which classifies pixels as belonging to the lips, skin, or inner mouth classes. This model requires labeled examples to be trained, and we propose to label images automatically by employing a lip-shape model and a red-hue energy function. To improve the performance of lip-tracking, we propose to use blue marked-up image sequences of the same subject uttering the identical sentences as natural nonmarked-up ones. The easily extracted lip shapes from blue images are then mapped to the natural ones using acoustic information. The lip-shape estimates obtained simplify lip-tracking on the natural images, as they reduce the parameter space dimensionality in the red-hue energy minimization, thus yielding better contour shape and location estimates. We applied the proposed method to a small audio-visual database of three subjects, achieving errors in pixel classification around 6%, compared to 3% for hand-placed contours and 20% for filtered red-hue.