Extraction of Visual Features for Lipreading
IEEE Transactions on Pattern Analysis and Machine Intelligence
Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns
IEEE Transactions on Pattern Analysis and Machine Intelligence
Person identification using automatic integration of speech, lip, and face experts
WBMA '03 Proceedings of the 2003 ACM SIGMM workshop on Biometrics methods and applications
Proceedings of the 6th international conference on Multimodal interfaces
Visual Speech Recognition with Loosely Synchronized Feature Streams
ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Local Binary Patterns as an Image Preprocessing for Face Authentication
FGR '06 Proceedings of the 7th International Conference on Automatic Face and Gesture Recognition
Product HMMs for audio-visual continuous speech recognition using facial animation parameters
ICME '03 Proceedings of the 2003 International Conference on Multimedia and Expo - Volume 1
Face Description with Local Binary Patterns: Application to Face Recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence
Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions
IEEE Transactions on Pattern Analysis and Machine Intelligence
Audio-visual speech recognition using MPEG-4 compliant visual features
EURASIP Journal on Applied Signal Processing
Dynamic Bayesian networks for audio-visual speech recognition
EURASIP Journal on Applied Signal Processing
IEEE Transactions on Image Processing
Dynamic Texture Based Gait Recognition
ICB '09 Proceedings of the Third International Conference on Advances in Biometrics
Lipreading with local spatiotemporal descriptors
IEEE Transactions on Multimedia
Hi-index | 0.00 |
Visual speech information plays an important role in speech recognition under noisy conditions or for listeners with hearing impairment. In this paper, we propose local spatiotemporal descriptors to represent and recognize spoken isolated phrases based solely on visual input. Positions of the eyes determined by a robust face and eye detector are used for localizing the mouth regions in face images. Spatiotemporal local binary patterns extracted from these regions are used for describing phrase sequences. In our experiments with 817 sequences from ten phrases and 20 speakers, promising accuracies of 62% and 70% were obtained in speaker-independent and speaker-dependent recognition, respectively. In comparison with other methods on the Tulips1 audio-visual database, the accuracy 92.7% of our method clearly out performs the others. Advantages of our approach include local processing and robustness to monotonic gray-scale changes. Moreover, no error prone segmentation of moving lips is needed.