A latent semantic retrieval and clustering system for personal photos with sparse speech annotation

Authors:
Yi-Sheng Fu;Winston H. Hsu;Lin-Shan Lee
Affiliations:
National Taiwan University, Taipei, Taiwan Roc;National Taiwan University, Taipei, Taiwan Roc;National Taiwan University, Taipei, Taiwan Roc
Venue:
SSCS '09 Proceedings of the third workshop on Searching spontaneous conversational speech
Year:
2009

Citing 7
Cited 0

VisualSEEk: a fully automated content-based image query system

MULTIMEDIA '96 Proceedings of the fourth ACM international conference on Multimedia
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Video Google: A Text Retrieval Approach to Object Matching in Videos

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Large-Scale Concept Ontology for Multimedia

IEEE MultiMedia
Position specific posterior lattices for indexing speech

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
How flickr helps us make sense of the world: context and content in community-contributed media collections

Proceedings of the 15th international conference on Multimedia
Latent semantic retrieval of personal photos with sparse user annotation by fused image/speech/text features

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this demo we present a user-friendly latent semantic retrieval and clustering system for personal photos with sparse spontaneous speech tags annotated when the photos were taken. Only 10% of the photos need to be annotated by spontaneous speech of a few words regarding one or two semantic categories (e.g. what or where), while all photos can be effectively retrieved using high-level semantic queries in words (e.g. who, what, where, when) and clustered by the semantics as well. We use low-level image features to construct the relationships among photos, but train semantic models using Probabilistic Latent Semantic Analysis (PLSA) based on fused speech and image features to derive the "topics" of the photos. The sparse speech annotations serve as the user interface for the whole personal photo archive, while photos not annotated are automatically related by fused features and semantic topics of PLSA.