Unsupervised language learning for discovered visual concepts
ACCV'12 Proceedings of the 11th Asian conference on Computer Vision - Volume Part IV
Hi-index | 0.00 |
We propose that appearance descriptors derived from the complete animacy of an object during its scene presence more comprehensively capture the essence of an object than descriptors that merely encode uncorrelated sets of its instantaneous appearances. During its frame presence, an object presents itself in many poses with differing frequencies, thus generating multiple modes of varying strengths in the appearance feature space. Further, we utilize tracking information to extract the set of all appearances of the object, while excluding those intervals where the object is partly or fully occluded by other objects or background entities. This allows for completely unsupervised computation of the descriptors that consist of time-indexed vectors from shape and Haar feature templates which are then clustered to obtain appearance modes. These lead to the construction of object-animacy models as probability distributions over the space of co-occurrent shape and Haar templates. These object models are clustered further in an unsupervised manner by using different spatial clustering algorithms with a Bhattacharya distance metric between object models. Unsupervised categorization results on simple (PETS2000) and complex traffic scenes consisting of a wide variety of objects show robust performance of the proposed approach.