Probabilistic models of text and images

Authors:
David Meir Blei;Michael I. Jordan
Affiliations:
University of California, Berkeley;University of California, Berkeley
Venue:
Probabilistic models of text and images
Year:
2004

Citing 0
Cited 6

Nonparametric Bayesian Image Segmentation

International Journal of Computer Vision
A Probabilistic Semantic Based Mixture Collaborative Filtering

UIC '09 Proceedings of the 6th International Conference on Ubiquitous Intelligence and Computing
The segmented and annotated IAPR TC-12 benchmark

Computer Vision and Image Understanding
Smoothing LDA model for text categorization

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Topic models for image annotation and text illustration

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Predictive Distribution of the Dirichlet Mixture Model by Local Variational Inference

Journal of Signal Processing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Managing large and growing collections of information is a central goal of modern computer science. Data repositories of texts, images, sounds, and genetic information have become widely accessible, thus necessitating good methods of retrieval, organization, and exploration. In this thesis, we describe a suite of probabilistic models of information collections for which the above problems can be cast as statistical queries. We use directed graphical models as a flexible, modular framework for describing appropriate modeling assumptions about the data. Fast approximate posterior inference algorithms based on variational methods free us from having to specify tractable models, and further allow us to take the Bayesian perspective, even in the face of large datasets. With this framework in hand, we describe latent Dirichlet allocation (LDA), a graphical model particularly suited to analyzing text collections. LDA posits a finite index of hidden topics which describe the underlying documents. New documents are situated into the collection via approximate posterior inference of their associated index terms. Extensions to LDA can index a set of images, or multimedia collections of interrelated text and images. Finally, we describe nonparametric Bayesian methods for relaxing the assumption of a fixed number of topics, and develop models based on the natural assumption that the size of the index can grow with the collection. This idea is extended to trees, and to models which represent the hidden structure and content of a topic hierarchy that underlies a collection.