Organizing the OCA: learning faceted subjects from a library of digital books

Authors:
David Mimno;Andrew McCallum
Affiliations:
University of Massachusetts: Amherst, Amherst, MA;University of Massachusetts: Amherst, Amherst, MA
Venue:
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Year:
2007

Citing 10
Cited 19

Text Mining in the SOMLib Digital Library System: The Representation of Topics and Genres

Applied Intelligence
Latent dirichlet allocation

The Journal of Machine Learning Research
Predicting library of congress classifications from library of congress subject headings

Journal of the American Society for Information Science and Technology
A cross-collection mixture model for comparative text mining

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
An initial evaluation of automated organization for digital library browsing

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Modeling word burstiness using the Dirichlet distribution

ICML '05 Proceedings of the 22nd international conference on Machine learning
Hierarchical Dirichlet model for document classification

ICML '05 Proceedings of the 22nd international conference on Machine learning
Clustering versus faceted categories for information exploration

Communications of the ACM - Supporting exploratory search
Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution

ICML '06 Proceedings of the 23rd international conference on Machine learning
LDA-based document models for ad-hoc retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Fast collapsed gibbs sampling for latent dirichlet allocation

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Topic model methods for automatically identifying out-of-scope resources

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Collecting fragmentary authors in a digital library

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies

Journal of the ACM (JACM)
Distributed Algorithms for Topic Models

The Journal of Machine Learning Research
Comparing LDA with pLSI as a dimensionality reduction method in document clustering

LKR'08 Proceedings of the 3rd international conference on Large-scale knowledge resources: construction and application
Evaluating topic models for digital libraries

Proceedings of the 10th annual joint conference on Digital libraries
Evaluating models of latent document semantics in the presence of OCR errors

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Conceptualizing large-scale information access efforts: the case for historical context

Proceedings of the 73rd ASIS&T Annual Meeting on Navigating Streams in an Information Ecosystem - Volume 47
PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing

ACM Transactions on Intelligent Systems and Technology (TIST)
Measuring historical word sense variation

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Are learned topics more useful than subject headings

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Regularized latent semantic indexing

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Latent topic feedback for information retrieval

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Bayesian checking for topic models

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Extracting two thousand years of latin from a million book library

Journal on Computing and Cultural Heritage (JOCCH)
Topic discovery and topic-driven clustering for audit method datasets

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part II
Group matrix factorization for scalable topic modeling

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Regularized Latent Semantic Indexing: A New Approach to Large-Scale Topic Modeling

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large scale library digitization projects such as the Open Content Alliance are producing vast quantities of text, but little has been done to organize this data. Subject headings inherited from card catalogs are useful but limited, while full-text indexing is most appropriate for readers who already know exactly what they want. Statistical topic models provide a complementary function. These models can identify semantically coherent "topics" that are easily recognizable and meaningful to humans, but they have been too computationally intensive to run on library-scale corpora. This paper presents DCM-LDA, a topic model based on Dirichlet Compound Multinomial distributions. This model is simultaneously better able to represent observed properties of text and more scalable to extremely large text collections. We train individual topic models for each book based on the cooccurrence of words within pages. We then cluster topics across books. The resulting topical clusters can be interpreted as subject facets, allowing readers to browse the topics of a collection quickly, find relevant books using topically expanded keyword searches, and explore topical relationships between books. We demonstrate this method finding topics on a corpus of 1.49 billion words from 42,000 books in less than 20 hours, and it easily could scale well beyond this.