The Journal of Machine Learning Research
Predicting library of congress classifications from library of congress subject headings
Journal of the American Society for Information Science and Technology
A cross-collection mixture model for comparative text mining
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
An initial evaluation of automated organization for digital library browsing
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Modeling word burstiness using the Dirichlet distribution
ICML '05 Proceedings of the 22nd international conference on Machine learning
Hierarchical Dirichlet model for document classification
ICML '05 Proceedings of the 22nd international conference on Machine learning
Clustering versus faceted categories for information exploration
Communications of the ACM - Supporting exploratory search
ICML '06 Proceedings of the 23rd international conference on Machine learning
LDA-based document models for ad-hoc retrieval
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Fast collapsed gibbs sampling for latent dirichlet allocation
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Topic model methods for automatically identifying out-of-scope resources
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Collecting fragmentary authors in a digital library
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies
Journal of the ACM (JACM)
Distributed Algorithms for Topic Models
The Journal of Machine Learning Research
Comparing LDA with pLSI as a dimensionality reduction method in document clustering
LKR'08 Proceedings of the 3rd international conference on Large-scale knowledge resources: construction and application
Evaluating topic models for digital libraries
Proceedings of the 10th annual joint conference on Digital libraries
Evaluating models of latent document semantics in the presence of OCR errors
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Conceptualizing large-scale information access efforts: the case for historical context
Proceedings of the 73rd ASIS&T Annual Meeting on Navigating Streams in an Information Ecosystem - Volume 47
PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing
ACM Transactions on Intelligent Systems and Technology (TIST)
Measuring historical word sense variation
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Are learned topics more useful than subject headings
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Regularized latent semantic indexing
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Latent topic feedback for information retrieval
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Bayesian checking for topic models
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Extracting two thousand years of latin from a million book library
Journal on Computing and Cultural Heritage (JOCCH)
Topic discovery and topic-driven clustering for audit method datasets
ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part II
Group matrix factorization for scalable topic modeling
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Regularized Latent Semantic Indexing: A New Approach to Large-Scale Topic Modeling
ACM Transactions on Information Systems (TOIS)
Hi-index | 0.00 |
Large scale library digitization projects such as the Open Content Alliance are producing vast quantities of text, but little has been done to organize this data. Subject headings inherited from card catalogs are useful but limited, while full-text indexing is most appropriate for readers who already know exactly what they want. Statistical topic models provide a complementary function. These models can identify semantically coherent "topics" that are easily recognizable and meaningful to humans, but they have been too computationally intensive to run on library-scale corpora. This paper presents DCM-LDA, a topic model based on Dirichlet Compound Multinomial distributions. This model is simultaneously better able to represent observed properties of text and more scalable to extremely large text collections. We train individual topic models for each book based on the cooccurrence of words within pages. We then cluster topics across books. The resulting topical clusters can be interpreted as subject facets, allowing readers to browse the topics of a collection quickly, find relevant books using topically expanded keyword searches, and explore topical relationships between books. We demonstrate this method finding topics on a corpus of 1.49 billion words from 42,000 books in less than 20 hours, and it easily could scale well beyond this.