Utilizing visual attention for cross-modal coreference interpretation

Authors:
Donna Byron;Thomas Mampilly;Vinay Sharma;Tianfang Xu
Affiliations:
Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio;Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio;Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio;Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio
Venue:
CONTEXT'05 Proceedings of the 5th international conference on Modeling and Using Context
Year:
2005

Citing 15
Cited 17

Centering: a framework for modeling the local coherence of discourse

Computational Linguistics
Limited attention and discourse structure

Computational Linguistics
Understanding Natural Language

Understanding Natural Language
Cognitive Status and Form of Reference in Multimodal Human-Computer Interaction

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Mutual disambiguation of 3D multimodal interaction in augmented and virtual reality

Proceedings of the 5th international conference on Multimodal interfaces
Using eye movements to determine referents in a spoken dialogue system

Proceedings of the 2001 workshop on Perceptive user interfaces
A centering approach to pronouns

ACL '87 Proceedings of the 25th annual meeting on Association for Computational Linguistics
Anaphora resolution: short-term memory and focusing

ACL '85 Proceedings of the 23rd annual meeting on Association for Computational Linguistics
Resolving anaphors in embedded sentences

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Visual Salience and Reference Resolution in Simulated 3-D Environments

Artificial Intelligence Review
Natural language and inference in a computer game

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Resolving pronominal reference to abstract entities

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
MATCH: an architecture for multimodal dialogue systems

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Empirical evaluations of pronoun resolution

Empirical evaluations of pronoun resolution
Dynamically structuring, updating and interrelating representations of visual and linguistic discourse context

Artificial Intelligence - Special volume on connecting language to the world

A salience driven approach to robust input interpretation in multimodal conversational systems

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Modeling the impact of shared visual information on collaborative reference

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
What's in a gaze?: the role of eye-gaze in reference resolution in multimodal conversational interfaces

Proceedings of the 13th international conference on Intelligent user interfaces
The babbleTunes system: talk to your ipod!

ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
What's there to talk about?: a multi-modal model of referring behavior in the presence of shared visual information

EACL '06 Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
A Japanese corpus of referring expressions used in a situated collaboration task

ENLG '09 Proceedings of the 12th European Workshop on Natural Language Generation
Between linguistic attention and gaze fixations inmultimodal conversational interfaces

Proceedings of the 2009 international conference on Multimodal interfaces
The role of interactivity in human-machine conversation for automatic word acquisition

SIGDIAL '09 Proceedings of the SIGDIAL 2009 Conference: The 10th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Incorporating extra-linguistic information into reference resolution in collaborative task dialogue

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Context-based word acquisition for situated dialogue in a virtual world

Journal of Artificial Intelligence Research
Fusing eye gaze with speech recognition hypotheses to resolve exophoric references in situated dialogue

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Towards an extrinsic evaluation of referring expressions in situated dialogs

INLG '10 Proceedings of the 6th International Natural Language Generation Conference
Reference reversibility with reference domain theory

SIGDIAL '10 Proceedings of the 11th Annual Meeting of the Special Interest Group on Discourse and Dialogue
See what i'm saying?: using Dyadic Mobile Eye tracking to study collaborative reference

Proceedings of the ACM 2011 conference on Computer supported cooperative work
Integrating domain knowledge with user eye gaze in automated word acquisition for conversational interfaces

Proceedings of the 2010 workshop on Eye gaze in intelligent human machine interaction
Integrating word acquisition and referential grounding towards physical world interaction

Proceedings of the 14th ACM international conference on Multimodal interaction
REX-J: Japanese referring expression corpus of situated dialogs

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we describe an exploratory study to develop a model of visual attention that could aid automatic interpretation of exophors in situated dialog. The model is intended to support the reference resolution needs of embodied conversational agents, such as graphical avatars and robotic collaborators. The model tracks the attentional state of one dialog participant as it is represented by his visual input stream, taking into account the recency, exposure time, and visual distinctness of each viewed item. The model correctly predicts the correct referent of 52% of referring expressions produced by speakers in human-human dialog while they were collaborating on a task in a virtual world. This accuracy is comparable with reference resolution based on calculating linguistic salience for the same data.