Grounding spatial prepositions for video search

Authors:
Stefanie Tellex;Deb Roy
Affiliations:
MIT Media Lab, Cambridge, MA, USA;MIT Media Lab, Cambridge, MA, USA
Venue:
Proceedings of the 2009 international conference on Multimodal interfaces
Year:
2009

Citing 10
Cited 2

The acquisition of lexical semantics for spatial terms: a connectionist model of perceptual categorization

The acquisition of lexical semantics for spatial terms: a connectionist model of perceptual categorization
Computer vision techniques for PDA accessibility of in-house video surveillance

IWVS '03 First ACM SIGMM international workshop on Video surveillance
Large-Scale Concept Ontology for Multimedia

IEEE MultiMedia
Mining temporal patterns of movement for video content classification

MIR '06 Proceedings of the 8th ACM international workshop on Multimedia information retrieval
Video search in concept subspace: a text-like paradigm

Proceedings of the 6th ACM international conference on Image and video retrieval
State-of-the-art on spatio-temporal information-based video retrieval

Pattern Recognition
Interactive retrieval for multi-camera surveillance systems featuring spatio-temporal summarization

MM '08 Proceedings of the 16th ACM international conference on Multimedia
Applying computational models of spatial prepositions to visually situated dialog

Computational Linguistics
Towards surveillance video search by natural language query

Proceedings of the ACM International Conference on Image and Video Retrieval
The human speechome project

EELC'06 Proceedings of the Third international conference on Emergence and Evolution of Linguistic Communication: symbol Grounding and Beyond

Toward understanding natural language directions

Proceedings of the 5th ACM/IEEE international conference on Human-robot interaction
A game-theoretic approach to generating spatial descriptions

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Spatial language video retrieval is an important real-world problem that forms a test bed for evaluating semantic structures for natural language descriptions of motion on naturalistic data. Video search by natural language query requires that linguistic input be converted into structures that operate on video in order to find clips that match a query. This paper describes a framework for grounding the meaning of spatial prepositions in video. We present a library of features that can be used to automatically classify a video clip based on whether it matches a natural language query. To evaluate these features, we collected a corpus of natural language descriptions about the motion of people in video clips. We characterize the language used in the corpus, and use it to train and test models for the meanings of the spatial prepositions "to," "across," "through," "out," "along," "towards," and "around." The classifiers can be used to build a spatial language video retrieval system that finds clips matching queries such as "across the kitchen."