Topic model methods for automatically identifying out-of-scope resources

Authors:
Steven Bethard;Soumya Ghosh;James H. Martin;Tamara Sumner
Affiliations:
Stanford University, Stanford, CA, USA;University of Colorado, Boulder, CO, USA;University of Colorado, Boulder, CO, USA;University of Colorado, Boulder, CO, USA
Venue:
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Year:
2009

Citing 17
Cited 1

Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Evaluation of distance metrics for recognition based on non-negative matrix factorization

Pattern Recognition Letters
Using unlabeled data to improve text classification

Using unlabeled data to improve text classification
Latent dirichlet allocation

The Journal of Machine Learning Research
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Metrics for the scope of a collection: Research Articles

Journal of the American Society for Information Science and Technology
A support vector method for multivariate performance measures

ICML '05 Proceedings of the 22nd international conference on Machine learning
Hierarchical Dirichlet model for document classification

ICML '05 Proceedings of the 22nd international conference on Machine learning
Subject metadata enrichment using statistical topic models

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Organizing the OCA: learning faceted subjects from a library of digital books

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Text Categorization for Aligning Educational Standards

HICSS '07 Proceedings of the 40th Annual Hawaii International Conference on System Sciences
Raising the baseline for high-precision text classifiers

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Adapting svm for data sparseness and imbalance: A case study in information extraction

Natural Language Engineering
Beyond TFIDF weighting for text categorization in the vector space model

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Text categorization based on topic model

RSKT'08 Proceedings of the 3rd international conference on Rough sets and knowledge technology
Support vector machines for spam categorization

IEEE Transactions on Neural Networks

Automatic classification of documents in cold-start scenarios

Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent years have seen the rise of subject-themed digital libraries, such as the NSDL pathways and the Digital Library for Earth System Education (DLESE). These libraries often need to manually verify that contributed resources cover topics that fit within the theme of the library. We show that such scope judgments can be automated using a combination of text classification techniques and topic modeling. Our models address two significant challenges in making scope judgments: only a small number of out-of-scope resources are typically available, and the topic distinctions required for digital libraries are much more subtle than classic text classification problems. To meet these challenges, our models combine support vector machine learners optimized to different performance metrics and semantic topics induced by unsupervised statistical topic models. Our best model is able to distinguish resources that belong in DLESE from resources that don't with an accuracy of around 70%. We see these models as the first steps towards increasing the scalability of digital libraries and dramatically reducing the workload required to maintain them.