Bimodal automatic speech segmentation based on audio and visual information fusion

Authors:
Eren Akdemir;Tolga Ciloglu
Affiliations:
Electrical and Electronics Engineering Department, Middle East Technical University, 06531 Ankara, Turkey;Electrical and Electronics Engineering Department, Middle East Technical University, 06531 Ankara, Turkey
Venue:
Speech Communication
Year:
2011

Citing 8
Cited 0

Automatic segmentation and labeling of speech based on Hidden Markov Models

Speech Communication
Lip-motion analysis for speech segmentation in noise

Speech Communication
Invited Speech: "Speechreading: An Overview of Image Processing, Feature Extraction, Sensory Intergration and Pattern Recognition Techiques

FG '96 Proceedings of the 2nd International Conference on Automatic Face and Gesture Recognition (FG '96)
Phonetic alignment: speech synthesis-based vs. viterbi-based

Speech Communication
A fusion approach for automatic speech segmentation of large corpora with application to speech synthesis

Speech Communication
The use of articulator motion information in automatic speech segmentation

Speech Communication
On Using Multiple Models for Automatic Speech Segmentation

IEEE Transactions on Audio, Speech, and Language Processing
Analysis of lip geometric features for audio-visual speech recognition

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans

Quantified Score

Hi-index	0.00

Visualization

Abstract

Bimodal automatic speech segmentation using visual information together with audio data is introduced. The accuracy of automatic segmentation directly affects the quality of speech processing systems using the segmented database. The collaboration of audio and visual data results in lower average absolute boundary error between the manual segmentation and automatic segmentation results. The information from two modalities are fused at the feature level and used in a HMM based speech segmentation system. A Turkish audiovisual speech database has been prepared and used in the experiments. The average absolute boundary error decreases up to 18% by using different audiovisual feature vectors. The benefits of incorporating visual information are discussed for different phoneme boundary types. Each audiovisual feature vector results in a different performance at different types of phoneme boundaries. The average absolute boundary error decreases by approximately 25% by using audiovisual feature vectors selectively for different boundary classes. Visual data is collected using an ordinary webcam. The proposed method is very convenient to be used in practice.