A maximum entropy approach to natural language processing
Computational Linguistics
Prosody-based automatic segmentation of speech into sentences and topics
Speech Communication - Special issue on accessing information in spoken audio
Multimodal human discourse: gesture and speech
ACM Transactions on Computer-Human Interaction (TOCHI)
Multimodal model integration for sentence unit detection
Proceedings of the 6th international conference on Multimodal interfaces
MacVisSTA: a system for multimodal analysis
Proceedings of the 6th international conference on Multimodal interfaces
Structural event detection for rich transcription of speech
Structural event detection for rich transcription of speech
VACE multimodal meeting corpus
MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Gesture salience as a hidden variable for coreference resolution and keyframe extraction
Journal of Artificial Intelligence Research
The recognition and comprehension of hand gestures: a review and research agenda
ZiF'06 Proceedings of the Embodied communication in humans and machines, 2nd ZiF research group international conference on Modeling communication with robots and virtual humans
Utilizing gestures to improve sentence boundary detection
Multimedia Tools and Applications
Hi-index | 0.00 |
Accurate identification of sentence units (SUs) in spontaneous speech has been found to improve the accuracy of speech recognition, as well as downstream applications such as parsing. In recent multimodal investigations, gestur]al features were utilized, in addition to lexical and prosodic cues from the speech channel, for detecting SUs in conversational interactions using a hidden Markov model (HMM) approach. Although this approach is computationally efficient and provides a convenient way to modularize the knowledge sources, it has two drawbacks for our SU task. First, standard HMM training methods maximize the joint probability of observations and hidden events, as opposed to the posterior probability of a hidden event given observations, a criterion more closely related to SU classification error. A second challenge for integrating gestural features is that their absence sanctions neither SU events nor non-events; it is only the co-timing of gestures with the speech channel that should impact our model. To address these problems, a Maximum Entropy (ME) model is used to combine multimodal cues for SU estimation. Experiments carried out on VACE multi-party meetings confirm that the ME modeling approach provides a solid framework for multimodal integration.