Language Label Learning for Visual Concepts Discovered from Video Sequences

Authors:
Prithwijit Guha;Amitabha Mukerjee
Affiliations:
Department of Electrical Engineering, Indian Institute of Technology, Kanpur, Kanpur - 208016, Uttar Pradesh,;Department of Computer Science & Engineering, Indian Institute of Technology, Kanpur, Kanpur - 208016, Uttar Pradesh,
Venue:
Attention in Cognitive Systems. Theories and Systems from an Interdisciplinary Viewpoint
Year:
2008

Citing 10
Cited 0

Determination of optical flow and its discontinuities using non-linear diffusion

ECCV '94 Proceedings of the third European conference on Computer Vision (Vol. II)
Matching words and pictures

The Journal of Machine Learning Research
A multimodal learning interface for grounding spoken language in sensory perceptions

ACM Transactions on Applied Perception (TAP)
Improved Adaptive Gaussian Mixture Model for Background Subtraction

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 2 - Volume 02
Confidence Based updation of Motion Conspicuity in Dynamic Scenes

CRV '06 Proceedings of the The 3rd Canadian Conference on Computer and Robot Vision
Attention links sensing to recognition

Image and Vision Computing
Grounding the lexical semantics of verbs in visual perception using force dynamics and event logic

Journal of Artificial Intelligence Research
Learning to talk about events from narrated video in a construction grammar framework

Artificial Intelligence - Special volume on connecting language to the world
Semiotic schemas: A framework for grounding language in action and perception

Artificial Intelligence - Special volume on connecting language to the world
Spatio-temporal discovery: appearance + behavior = agent

ICVGIP'06 Proceedings of the 5th Indian conference on Computer Vision, Graphics and Image Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Computational models of grounded language learning have been based on the premise that words and concepts are learned simultaneously. Given the mounting cognitive evidence for concept formation in infants, we argue that the availability of pre-lexical concepts (learned from image sequences) leads to considerable computational efficiency in word acquisition. Key to the process is a model of bottom-up visual attention in dynamic scenes. Background learning and foreground segmentation is used to generate robust tracking and detect occlusion events. Trajectories are clustered to obtain motion event concepts. The object concepts (image schemas) are abstracted from the combined appearance and motion data. The set of acquired concepts under visual attentive focus are then correlated with contemporaneous commentary to learn the grounded semantics of words and multi-word phrasal concatenations from the narrative. We demonstrate that even based on a mere half hour of video (of a scene involving many objects and activities), a number of rudimentary concepts can be discovered. When these concepts are associated with unedited English commentary, we find that several words emerge - approximately half the identified concepts from the video are associated with the correct concepts. Thus, the computational model reflects the beginning of language comprehension, based on attentional parsing of the visual data. Finally, the emergence of multi-word phrasal concatenations, a precursor to syntax, is observed where they are more salient referents than single words.