Visual features extracting & selecting for lipreading

Authors:
Hong-Xun Yao;Wen Gao;Wei Shan;Ming-Hui Xu
Affiliations:
Department of Computer Science and Engineering, Harbin Institute of Technology, Harbin, China;Department of Computer Science and Engineering, Harbin Institute of Technology, Harbin, China and Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China;Department of Computer Science and Engineering, Harbin Institute of Technology, Harbin, China;Department of Computer Science and Engineering, Harbin Institute of Technology, Harbin, China
Venue:
AVBPA'03 Proceedings of the 4th international conference on Audio- and video-based biometric person authentication
Year:
2003

Citing 3
Cited 0

Extraction of Visual Features for Lipreading

IEEE Transactions on Pattern Analysis and Machine Intelligence
Asynchrony modeling for audio-visual speech recognition

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Audio-visual speech modeling for continuous speech recognition

IEEE Transactions on Multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper has put forward a way to select and extract visual features effectively for lipreading. These features come from both low-level and high-level, those are compensatory each other. There are 41 dimensional features to be used for recognition. Tested on a bimodal database AVCC which consists of sentences including all Chinese pronunciation, it achieves an accuracy of 87.8% from 84.1% for automatic speech recognition by lipreading assisting. It improves 19.5% accuracy from 31.7% to 51.2% for speakers dependent and improves 27.7% accuracy from 27.6% to 55.3% for speakers independent when speech recognition under noise conditions. And the paper has proves that visual speech information can reinforce the loss of acoustic information effectively by improving recognition rate from 10% to 30% various with the different amount of noises in speech signals in our system, the improving scope is higher than ASR system of IBM. And it performs better in noisy environments.