How many words is a picture worth? Automatic caption generation for news images
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Hi-index | 0.00 |
In this paper, we propose a method for recognizing human actions and objects and translating them into natural language text. First, 3D environmental map is constructed by accumulating range maps captured from a 3D range sensor mounted on a mobile robot. Then, pose of a person in the scene is estimated by fitting articulated cylindrical model and also object is recognized by matching 3D models. On condition that the person handles some objects, interaction with the object is classified. Finally, using conceptual model representing human actions and related objects, a natural language expression which is most suitable to explain the person's action is generated.