Term weighting schemes for Latent Dirichlet Allocation

Authors:
Andrew T. Wilson;Peter A. Chew
Affiliations:
Sandia National Laboratories, Albuquerque, NM;Moss Adams LLP, Albuquerque, NM
Venue:
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Year:
2010

Citing 11
Cited 4

Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval

Information Retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
Unsupervised learning of the morphology of a natural language

Computational Linguistics
Multiple organism gene finding by collapsed gibbs sampling

RECOMB '04 Proceedings of the eighth annual international conference on Resaerch in computational molecular biology
Clustering the tagged web

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Mining multilingual topics from wikipedia

Proceedings of the 18th international conference on World wide web
Explicit versus latent concept models for cross-language information retrieval

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
An information-theoretic, vector-space-model approach to cross-language information retrieval*

Natural Language Engineering
Divergence measures based on the Shannon entropy

IEEE Transactions on Information Theory

DTTM: a discriminative temporal topic model for facial expression recognition

ISVC'11 Proceedings of the 7th international conference on Advances in visual computing - Volume Part I
Unsupervised part-of-speech tagging in noisy and esoteric domains with a syntactic-semantic Bayesian HMM

Proceedings of the Workshop on Semantic Analysis in Social Media
Incorporating popularity in topic models for social network analysis

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
A study on document retrieval system based on visualization to manage OCR documents

HCI'13 Proceedings of the 15th international conference on Human-Computer Interaction: interaction modalities and techniques - Volume Part IV

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many implementations of Latent Dirichlet Allocation (LDA), including those described in Blei et al. (2003), rely at some point on the removal of stopwords, words which are assumed to contribute little to the meaning of the text. This step is considered necessary because otherwise high-frequency words tend to end up scattered across many of the latent topics without much rhyme or reason. We show, however, that the 'problem' of high-frequency words can be dealt with more elegantly, and in a way that to our knowledge has not been considered in LDA, through the use of appropriate weighting schemes comparable to those sometimes used in Latent Semantic Indexing (LSI). Our proposed weighting methods not only make theoretical sense, but can also be shown to improve precision significantly on a non-trivial cross-language retrieval task.