Subject metadata enrichment using statistical topic models

Authors:
David Newman;Kat Hagedorn;Chaitanya Chemudugunta;Padhraic Smyth
Affiliations:
UC Irvine, Irvine, CA;University of Michigan, Ann Arbor, MI;UC Irvine, Irvine, CA;UC Irvine, Irvine, CA
Venue:
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Year:
2007

Citing 10
Cited 10

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Concept decompositions for large sparse text data using clustering

Machine Learning
Unsupervised learning by probabilistic latent semantic analysis

Machine Learning
Latent dirichlet allocation

The Journal of Machine Learning Research
A Scalable Topic-Based Open Source Search Engine

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
An initial evaluation of automated organization for digital library browsing

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Probabilistic topic decomposition of an eighteenth-century American newspaper

Journal of the American Society for Information Science and Technology
Bibliometric impact measures leveraging topic analysis

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Pachinko allocation: DAG-structured mixture models of topic correlations

ICML '06 Proceedings of the 23rd international conference on Machine learning
Analyzing entities and topics in news articles using statistical topic models

ISI'06 Proceedings of the 4th IEEE international conference on Intelligence and Security Informatics

Fast collapsed gibbs sampling for latent dirichlet allocation

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Latent Style Model: Discovering writing styles for calligraphy works

Journal of Visual Communication and Image Representation
Topic model methods for automatically identifying out-of-scope resources

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Latent Dirichlet Allocation with topic-in-set knowledge

SemiSupLearn '09 Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing
Leveraging personal metadata for Desktop search: The Beagle++ system

Web Semantics: Science, Services and Agents on the World Wide Web
Evaluating topic models for digital libraries

Proceedings of the 10th annual joint conference on Digital libraries
Are learned topics more useful than subject headings

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Using statistical topic models to organize and visualize large-scale architectural image databases

ACM SIGGRAPH 2011 Posters
Metadata enrichment via topic models for author name disambiguation

NLP4DL'09/AT4DL'09 Proceedings of the 2009 international conference on Advanced language technologies for digital libraries
Automatic tag recommendation for metadata annotation using probabilistic topic modeling

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

Creating a collection of metadata records from disparate and diverse sources often results in uneven, unreliable and variable quality subject metadata. Having uniform, consistent and enriched subject metadata allows users to more easily discover material, browse the collection, and limit keyword search results by subject. We demonstrate how statistical topic models are useful for subject metadata enrichment. We describe some of the challenges of metadata enrichment on a huge scale (10 million metadata records from 700 repositories in the OAIster Digital Library) when the metadata is highly heterogeneous (metadata about images and text, and both cultural heritage material and scientific literature). We show how to improve the quality of the enriched metadata, using both manual and statistical modeling techniques. Finally, we discuss some of the challenges of the production environment, and demonstrate the value of the enriched metadata in a prototype portal.