Unsupervised language learning for discovered visual concepts

Authors:
Prithwijit Guha;Amitabha Mukerjee
Affiliations:
Department of Electronics & Electrical Engineering, IIT Guwahati, India;Department of Computer Science & Engineering, IIT Kanpur, India
Venue:
ACCV'12 Proceedings of the 11th Asian conference on Computer Vision - Volume Part IV
Year:
2012

Citing 11
Cited 0

Matching words and pictures

The Journal of Machine Learning Research
A multimodal learning interface for grounding spoken language in sensory perceptions

ACM Transactions on Applied Perception (TAP)
Improved Adaptive Gaussian Mixture Model for Background Subtraction

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 2 - Volume 02
Movie/Script: Alignment and Parsing of Video and Text Transcription

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part IV
Objects from Animacy: Joint Discovery in Shape and Haar Feature Space

ICVGIP '08 Proceedings of the 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing
Grounding the lexical semantics of verbs in visual perception using force dynamics and event logic

Journal of Artificial Intelligence Research
Learning to talk about events from narrated video in a construction grammar framework

Artificial Intelligence - Special volume on connecting language to the world
Multiple Bernoulli relevance models for image and video annotation

CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition
Activity discovery using compressed suffix trees

ICIAP'11 Proceedings of the 16th international conference on Image analysis and processing - Volume Part II
Formulation, detection and application of occlusion states (Oc-7) in the context of multiple object tracking

AVSS '11 Proceedings of the 2011 8th IEEE International Conference on Advanced Video and Signal Based Surveillance
Baby talk: Understanding and generating simple image descriptions

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Computational models of grounded language learning have been based on the premise that words and concepts are learned simultaneously. Given the mounting cognitive evidence for concept formation in infants, we argue that the availability of pre-lexical concepts (learned from image sequences) leads to considerable computational efficiency in word acquisition. Key to the process is a model of bottom-up visual attention in dynamic scenes. We have used existing work in background-foreground segmentation, multiple object tracking, object discovery and trajectory clustering to form object category and action concepts. The set of acquired concepts under visual attentive focus are then correlated with contemporaneous commentary to learn the grounded semantics of words and multi-word phrasal concatenations from the narrative. We demonstrate that even based on mere 5 minutes of video segments, a number of rudimentary visual concepts can be discovered. When these concepts are associated with unedited English commentary, we observe that several words emerge - more than 60% of the concepts discovered from the video are associated with correct language labels. Thus, the computational model imitates the beginning of language comprehension, based on attentional parsing of the visual data. Finally, the emergence of multi-word phrasal concatenations, a precursor to syntax, is observed where there are more salient referents than single words.