Gesture salience as a hidden variable for coreference resolution and keyframe extraction

Authors:
Jacob Eisenstein;Regina Barzilay;Randall Davis
Affiliations:
Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA;Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA;Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA
Venue:
Journal of Artificial Intelligence Research
Year:
2008

Citing 62
Cited 0

On the limited memory BFGS method for large scale optimization

Mathematical Programming: Series A and B
An algorithm for pronominal anaphora resolution

Computational Linguistics
Centering: a framework for modeling the local coherence of discourse

Computational Linguistics
QuickSet: multimodal interaction for distributed applications

MULTIMEDIA '97 Proceedings of the fifth ACM international conference on Multimedia
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Video Manga: generating semantically meaningful video summaries

MULTIMEDIA '99 Proceedings of the seventh ACM international conference on Multimedia (Part 1)
An interactive comic book presentation for exploring video

Proceedings of the SIGCHI conference on Human Factors in Computing Systems
Prosody-based automatic segmentation of speech into sentences and topics

Speech Communication - Special issue on accessing information in spoken audio
Advances in Automatic Text Summarization

Advances in Automatic Text Summarization
Multimodal human discourse: gesture and speech

ACM Transactions on Computer-Human Interaction (TOCHI)
Cognitive Status and Form of Reference in Multimodal Human-Computer Interaction

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Towards a Computational Theory of Definite Anaphora Comprehension in English Discourse

Towards a Computational Theory of Definite Anaphora Comprehension in English Discourse
Hierarchical video content description and summarization using unified semantic and visual similarity

Multimedia Systems
A machine learning approach to coreference resolution of noun phrases

Computational Linguistics - Special issue on computational anaphora resolution
Functional centering: grounding referential coherence in information structure

Computational Linguistics
A property-sharing constraint in Centering

ACL '86 Proceedings of the 24th annual meeting on Association for Computational Linguistics
A centering approach to pronouns

ACL '87 Proceedings of the 25th annual meeting on Association for Computational Linguistics
Multi-paragraph segmentation of expository text

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Finite-state multimodal parsing and understanding

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Multimodal model integration for sentence unit detection

Proceedings of the 6th international conference on Multimodal interfaces
A model-theoretic coreference scoring scheme

MUC6 '95 Proceedings of the 6th conference on Message understanding
Improving machine learning approaches to coreference resolution

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Applying Co-Training to reference resolution

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Text and knowledge mining for coreference resolution

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Applying co-training methods to statistical parsing

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Chunking with support vector machines

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Shallow parsing with conditional random fields

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
A machine learning approach to pronoun resolution in spoken dialogue

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Coreference resolution using competition learning approach

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Towards a model of face-to-face grounding

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
An overview of the SPHINX-II speech recognition system

HLT '93 Proceedings of the workshop on Human Language Technology
Structural event detection for rich transcription of speech

Structural event detection for rich transcription of speech
Optimizing Referential Coherence in Text Generation

Computational Linguistics
Centering: A Parametric Theory and Its Instantiations

Computational Linguistics
Exploring evidence for shallow parsing

ConLL '01 Proceedings of the 2001 workshop on Computational Natural Language Learning - Volume 7
The influence of minimum edit distance on reference resolution

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Content-based multimedia information retrieval: State of the art and challenges

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Hand Motion Gesture Frequency Properties and Multimodal Discourse Analysis

International Journal of Computer Vision
Hidden Conditional Random Fields for Gesture Recognition

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Using maximum entropy (ME) model to incorporate gesture cues for SU detection

Proceedings of the 8th international conference on Multimodal interfaces
Optimization in multimodal interpretation

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Modeling local coherence: an entity-based approach

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Improving pronoun resolution using statistics-based semantic compatibility information

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Minimum cut model for spoken lecture segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Using semantic relations to refine coreference decisions

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
On coreference resolution performance metrics

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
A large-scale exploration of effective global features for a joint entity detection and tracking model

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
A salience driven approach to robust input interpretation in multimodal conversational systems

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Effective use of prosody in parsing conversational speech

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Hidden-variable models for discriminative reranking

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Computational approaches to temporal sampling of video sequences

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data

The Journal of Machine Learning Research
Domain adaptation with structural correspondence learning

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Gesture improves coreference resolution

NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
Turning lectures into comic books using linguistically salient gestures

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 1
Learning content selection rules for generating object descriptions in dialogue

Journal of Artificial Intelligence Research
Shallow semantics for coreference resolution

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
The use of the area under the ROC curve in the evaluation of machine learning algorithms

Pattern Recognition
The AMI meeting corpus: a pre-announcement

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
VACE multimodal meeting corpus

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Prosody based audiovisual coanalysis for coverbal gesture recognition

IEEE Transactions on Multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

Gesture is a non-verbal modality that can contribute crucial information to the understanding of natural language. But not all gestures are informative, and noncommunicative hand motions may confuse natural language processing (NLP) and impede learning. People have little difficulty ignoring irrelevant hand movements and focusing on meaningful gestures, suggesting that an automatic system could also be trained to perform this task. However, the informativeness of a gesture is context-dependent and labeling enough data to cover all cases would be expensive. We present conditional modality fusion, a conditional hidden-variable model that learns to predict which gestures are salient for coreference resolution, the task of determining whether two noun phrases refer to the same semantic entity. Moreover, our approach uses only coreference annotations, and not annotations of gesture salience itself. We show that gesture features improve performance on coreference resolution, and that by attending only to gestures that are salient, our method achieves further significant gains. In addition, we show that the model of gesture salience learned in the context of coreference accords with human intuition, by demonstrating that gestures judged to be salient by our model can be used successfully to create multimedia keyframe summaries of video. These summaries are similar to those created by human raters, and significantly outperform summaries produced by baselines from the literature.