Using Kullback-Leibler distance for text categorization

  • Authors:
  • Brigitte Bigi

  • Affiliations:
  • CLIPS-IMAG Laboratory, UMR CNRS, Grenoble cedex 9, France

  • Venue:
  • ECIR'03 Proceedings of the 25th European conference on IR research
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

A system that performs text categorization aims to assign appropriate categories from a predefined classification scheme to incoming documents. These assignments might be used for varied purposes such as filtering, or retrieval. This paper introduces a new effective model for text categorization with great corpus (more or less 1 million documents). Text categorization is performed using the Kullback-Leibler distance between the probability distribution of the document to classify and the probability distribution of each category. Using the same representation of categories, experiments show a significant improvement when the above mentioned method is used. KLD method achieve substantial improvements over the tfidf performing method.