Evaluating topic models for digital libraries

Authors:
David Newman;Youn Noh;Edmund Talley;Sarvnaz Karimi;Timothy Baldwin
Affiliations:
University of California, Irvine, Irvine, CA, USA;Yale University, New Haven, CT, USA;NIH, Washington, DC, USA;NICTA, Melbourne, Australia;University of Melbourne, Melbourne, Australia
Venue:
Proceedings of the 10th annual joint conference on Digital libraries
Year:
2010

Citing 9
Cited 12

Latent dirichlet allocation

The Journal of Machine Learning Research
Applying discrete PCA in data analysis

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Dynamic topic models

ICML '06 Proceedings of the 23rd international conference on Machine learning
Subject metadata enrichment using statistical topic models

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Organizing the OCA: learning faceted subjects from a library of digital books

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Automatic labeling of multinomial topic models

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Incorporating domain knowledge into topic modeling via Dirichlet Forest priors

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Improvements that don't add up: ad-hoc retrieval results since 1998

Proceedings of the 18th ACM conference on Information and knowledge management
Automatic evaluation of topic coherence

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Automatic evaluation of topic coherence

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Word order matters: measuring topic coherence with lexical argument structure

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Are learned topics more useful than subject headings

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Latent topic feedback for information retrieval

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
TopicNets: Visual Analysis of Large Text Corpora with Topic Modeling

ACM Transactions on Intelligent Systems and Technology (TIST)
Termite: visualization techniques for assessing textual topic models

Proceedings of the International Working Conference on Advanced Visual Interfaces
Evaluating unsupervised ensembles when applied to word sense induction

ACL '12 Proceedings of ACL 2012 Student Research Workshop
Exploring topic coherence over many models and many topics

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Evaluating the use of clustering for automatically organising digital library collections

TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries
An approach to query relaxation using ontologies in a GIS-based archiving system

Proceedings of the Third ACM SIGSPATIAL International Workshop on GeoStreaming
Discovering coherent topics using general knowledge

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Leveraging multi-domain prior knowledge in topic models

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Topic models could have a huge impact on improving the ways users find and discover content in digital libraries and search interfaces through their ability to automatically learn and apply subject tags to each and every item in a collection, and their ability to dynamically create virtual collections on the fly. However, much remains to be done to tap this potential, and empirically evaluate the true value of a given topic model to humans. In this work, we sketch out some sub-tasks that we suggest pave the way towards this goal, and present methods for assessing the coherence and interpretability of topics learned by topic models. Our large-scale user study includes over 70 human subjects evaluating and scoring almost 500 topics learned from collections from a wide range of genres and domains. We show how scoring model -- based on pointwise mutual information of word-pair using Wikipedia, Google and MEDLINE as external data sources - performs well at predicting human scores. This automated scoring of topics is an important first step to integrating topic modeling into digital libraries