Towards noise-resilient document modeling

Authors:
Tao Yang;Dongwon Lee
Affiliations:
The Pennsylvania State University, University Park, PA, USA;The Pennsylvania State University, University Park, PA, USA
Venue:
Proceedings of the 20th ACM international conference on Information and knowledge management
Year:
2011

Citing 3
Cited 1

Latent dirichlet allocation

The Journal of Machine Learning Research
Probabilistic models for topic learning from images and captions in online biomedical literatures

Proceedings of the 18th ACM conference on Information and knowledge management
Evaluating models of latent document semantics in the presence of OCR errors

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

On handling textual errors in latent document modeling

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

We introduce a generative probabilistic document model based on latent Dirichlet allocation (LDA), to deal with textual errors in the document collection. Our model is inspired by the fact that most large-scale text data are machine-generated and thus inevitably contain many types of noise. The new model, termed as TE-LDA, is developed from the traditional LDA by adding a switch variable into the term generation process in order to tackle the issue of noisy text data. Through extensive experiments, the efficacy of our proposed model is validated using both real and synthetic data sets.