Grounding spatial language for video search

Authors:
Stefanie Tellex;Thomas Kollar;George Shaw;Nicholas Roy;Deb Roy
Affiliations:
MIT Media Lab, Cambridge, MA;The Stata Center, MIT CSAIL, Cambridge, MA;MIT Media Lab, Cambridge, MA;The Stata Center, MIT CSAIL, Cambridge, MA;MIT Media Lab, Cambridge, MA
Venue:
International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction
Year:
2010

Citing 9
Cited 0

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Large-Scale Concept Ontology for Multimedia

IEEE MultiMedia
Mining temporal patterns of movement for video content classification

MIR '06 Proceedings of the 8th ACM international workshop on Multimedia information retrieval
Video search in concept subspace: a text-like paradigm

Proceedings of the 6th ACM international conference on Image and video retrieval
State-of-the-art on spatio-temporal information-based video retrieval

Pattern Recognition
Towards surveillance video search by natural language query

Proceedings of the ACM International Conference on Image and Video Retrieval
Utilizing object-object and object-scene context when planning to find things

ICRA'09 Proceedings of the 2009 IEEE international conference on Robotics and Automation
Toward understanding natural language directions

Proceedings of the 5th ACM/IEEE international conference on Human-robot interaction
The human speechome project

EELC'06 Proceedings of the Third international conference on Emergence and Evolution of Linguistic Communication: symbol Grounding and Beyond

Quantified Score

Hi-index	0.00

Visualization

Abstract

The ability to find a video clip that matches a natural language description of an event would enable intuitive search of large databases of surveillance video. We present a mechanism for connecting a spatial language query to a video clip corresponding to the query. The system can retrieve video clips matching millions of potential queries that describe complex events in video such as "people walking from the hallway door, around the island, to the kitchen sink." By breaking down the query into a sequence of independent structured clauses and modeling the meaning of each component of the structure separately, we are able to improve on previous approaches to video retrieval by finding clips that match much longer and more complex queries using a rich set of spatial relations such as "down" and "past." We present a rigorous analysis of the system's performance, based on a large corpus of task-constrained language collected from fourteen subjects. Using this corpus, we show that the system effectively retrieves clips that match natural language descriptions: 58.3% were ranked in the top two of ten in a retrieval task. Furthermore, we show that spatial relations play an important role in the system's performance.