Smoothing LDA model for text categorization

Authors:
Wenbo Li;Le Sun;Yuanyong Feng;Dakun Zhang
Affiliations:
Institute of Software, Chinese Academy of Sciences, Beijing, China and Graduate University of Chinese Academy of Sciences, Beijing, China;Institute of Software, Chinese Academy of Sciences, Beijing, China;Institute of Software, Chinese Academy of Sciences, Beijing, China and Graduate University of Chinese Academy of Sciences, Beijing, China;Institute of Software, Chinese Academy of Sciences, Beijing, China and Graduate University of Chinese Academy of Sciences, Beijing, China
Venue:
AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Year:
2008

Citing 11
Cited 0

An evaluation of phrasal and clustered representations on a text categorization task

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
An introduction to variational methods for graphical models

Learning in graphical models
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
Probabilistic models of text and images

Probabilistic models of text and images
Pachinko allocation: DAG-structured mixture models of topic correlations

ICML '06 Proceedings of the 23rd international conference on Machine learning
LDA-based document models for ad-hoc retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
The class imbalance problem: A systematic study

Intelligent Data Analysis
Taming wild phrases

ECIR'03 Proceedings of the 25th European conference on IR research
A novel field learning algorithm for dual imbalance text classification

FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Latent Dirichlet Allocation (LDA) is a document level language model. In general, LDA employ the symmetry Dirichlet distribution as prior of the topic-words' distributions to implement model smoothing. In this paper, we propose a data-driven smoothing strategy in which probability mass is allocated from smoothing-data to latent variables by the intrinsic inference procedure of LDA. In such a way, the arbitrariness of choosing latent variables' priors for the multi-level graphical model is overcome. Following this data-driven strategy, two concrete methods, Laplacian smoothing and Jelinek-Mercer smoothing, are employed to LDA model. Evaluations on different text categorization collections show data-driven smoothing can significantly improve the performance in balanced and unbalanced corpora.