Foundations of statistical natural language processing
Foundations of statistical natural language processing
Large-Scale Concept Ontology for Multimedia
IEEE MultiMedia
Mining temporal patterns of movement for video content classification
MIR '06 Proceedings of the 8th ACM international workshop on Multimedia information retrieval
Video search in concept subspace: a text-like paradigm
Proceedings of the 6th ACM international conference on Image and video retrieval
State-of-the-art on spatio-temporal information-based video retrieval
Pattern Recognition
Towards surveillance video search by natural language query
Proceedings of the ACM International Conference on Image and Video Retrieval
Utilizing object-object and object-scene context when planning to find things
ICRA'09 Proceedings of the 2009 IEEE international conference on Robotics and Automation
Toward understanding natural language directions
Proceedings of the 5th ACM/IEEE international conference on Human-robot interaction
EELC'06 Proceedings of the Third international conference on Emergence and Evolution of Linguistic Communication: symbol Grounding and Beyond
Hi-index | 0.00 |
The ability to find a video clip that matches a natural language description of an event would enable intuitive search of large databases of surveillance video. We present a mechanism for connecting a spatial language query to a video clip corresponding to the query. The system can retrieve video clips matching millions of potential queries that describe complex events in video such as "people walking from the hallway door, around the island, to the kitchen sink." By breaking down the query into a sequence of independent structured clauses and modeling the meaning of each component of the structure separately, we are able to improve on previous approaches to video retrieval by finding clips that match much longer and more complex queries using a rich set of spatial relations such as "down" and "past." We present a rigorous analysis of the system's performance, based on a large corpus of task-constrained language collected from fourteen subjects. Using this corpus, we show that the system effectively retrieves clips that match natural language descriptions: 58.3% were ranked in the top two of ten in a retrieval task. Furthermore, we show that spatial relations play an important role in the system's performance.