Using LDA to detect semantically incoherent documents

Authors:
Hemant Misra;Olivier Cappé;François Yvon
Affiliations:
LTCI/CNRS and TELECOM ParisTech;LTCI/CNRS and TELECOM ParisTech;Univ Paris-Sud and LMISI-CNRS
Venue:
CoNLL '08 Proceedings of the Twelfth Conference on Computational Natural Language Learning
Year:
2008

Citing 12
Cited 9

Exponentiated gradient versus gradient descent for linear predictors

Information and Computation
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Unsupervised learning by probabilistic latent semantic analysis

Machine Learning
TextTiling: segmenting text into multi-paragraph subtopic passages

Computational Linguistics
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
A Scalable Topic-Based Open Source Search Engine

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Applying discrete PCA in data analysis

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Inference and evaluation of the multinomial mixture model for text clustering

Information Processing and Management: an International Journal
Exponentiated gradient algorithms for log-linear structured prediction

Proceedings of the 24th international conference on Machine learning
The GENIA corpus: an annotated research abstract corpus in molecular biology domain

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Style & topic language model adaptation using HMM-LDA

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Expectation-propagation for the generative aspect model

UAI'02 Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence

Text segmentation via topic modeling: an analytical study

Proceedings of the 18th ACM conference on Information and knowledge management
Automatic evaluation of topic coherence

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Dynamic concept ontology construction for pubmed queries

DTMBIO '10 Proceedings of the ACM fourth international workshop on Data and text mining in biomedical informatics
Text segmentation: A topic modeling perspective

Information Processing and Management: an International Journal
Handling data sparsity in collaborative filtering using emotion and semantic based features

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
TV news story segmentation based on semantic coherence and content similarity

MMM'10 Proceedings of the 16th international conference on Advances in Multimedia Modeling
Semantic based adaptive movie summarisation

MMM'10 Proceedings of the 16th international conference on Advances in Multimedia Modeling
Addressing cold-start in app recommendation: latent user models constructed from twitter followers

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Unsupervised text segmentation using LDA and MCMC

AusDM '12 Proceedings of the Tenth Australasian Data Mining Conference - Volume 134

Quantified Score

Hi-index	0.00

Visualization

Abstract

Detecting the semantic coherence of a document is a challenging task and has several applications such as in text segmentation and categorization. This paper is an attempt to distinguish between a 'semantically coherent' true document and a 'randomly generated' false document through topic detection in the framework of latent Dirichlet analysis. Based on the premise that a true document contains only a few topics and a false document is made up of many topics, it is asserted that the entropy of the topic distribution will be lower for a true document than that for a false document. This hypothesis is tested on several false document sets generated by various methods and is found to be useful for fake content detection applications.