Using LDA to detect semantically incoherent documents

  • Authors:
  • Hemant Misra;Olivier Cappé;François Yvon

  • Affiliations:
  • LTCI/CNRS and TELECOM ParisTech;LTCI/CNRS and TELECOM ParisTech;Univ Paris-Sud and LMISI-CNRS

  • Venue:
  • CoNLL '08 Proceedings of the Twelfth Conference on Computational Natural Language Learning
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Detecting the semantic coherence of a document is a challenging task and has several applications such as in text segmentation and categorization. This paper is an attempt to distinguish between a 'semantically coherent' true document and a 'randomly generated' false document through topic detection in the framework of latent Dirichlet analysis. Based on the premise that a true document contains only a few topics and a false document is made up of many topics, it is asserted that the entropy of the topic distribution will be lower for a true document than that for a false document. This hypothesis is tested on several false document sets generated by various methods and is found to be useful for fake content detection applications.