Using maximum entropy (ME) model to incorporate gesture cues for SU detection

Authors:
Lei Chen;Mary Harper;Zhongqiang Huang
Affiliations:
Purdue University, West Lafayette, IN;Purdue University, West Lafayette, IN;Purdue University, West Lafayette, IN
Venue:
Proceedings of the 8th international conference on Multimodal interfaces
Year:
2006

Citing 7
Cited 3

A maximum entropy approach to natural language processing

Computational Linguistics
Prosody-based automatic segmentation of speech into sentences and topics

Speech Communication - Special issue on accessing information in spoken audio
Multimodal human discourse: gesture and speech

ACM Transactions on Computer-Human Interaction (TOCHI)
Multimodal model integration for sentence unit detection

Proceedings of the 6th international conference on Multimodal interfaces
MacVisSTA: a system for multimodal analysis

Proceedings of the 6th international conference on Multimodal interfaces
Structural event detection for rich transcription of speech

Structural event detection for rich transcription of speech
VACE multimodal meeting corpus

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction

Gesture salience as a hidden variable for coreference resolution and keyframe extraction

Journal of Artificial Intelligence Research
The recognition and comprehension of hand gestures: a review and research agenda

ZiF'06 Proceedings of the Embodied communication in humans and machines, 2nd ZiF research group international conference on Modeling communication with robots and virtual humans
Utilizing gestures to improve sentence boundary detection

Multimedia Tools and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Accurate identification of sentence units (SUs) in spontaneous speech has been found to improve the accuracy of speech recognition, as well as downstream applications such as parsing. In recent multimodal investigations, gestur]al features were utilized, in addition to lexical and prosodic cues from the speech channel, for detecting SUs in conversational interactions using a hidden Markov model (HMM) approach. Although this approach is computationally efficient and provides a convenient way to modularize the knowledge sources, it has two drawbacks for our SU task. First, standard HMM training methods maximize the joint probability of observations and hidden events, as opposed to the posterior probability of a hidden event given observations, a criterion more closely related to SU classification error. A second challenge for integrating gestural features is that their absence sanctions neither SU events nor non-events; it is only the co-timing of gestures with the speech channel that should impact our model. To address these problems, a Maximum Entropy (ME) model is used to combine multimodal cues for SU estimation. Experiments carried out on VACE multi-party meetings confirm that the ME modeling approach provides a solid framework for multimodal integration.