Text Categorization Using Compression Models

  • Authors:
  • Eibe Frank;Chang Chui;Ian H. Witten

  • Affiliations:
  • -;-;-

  • Venue:
  • DCC '00 Proceedings of the Conference on Data Compression
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

Text categorization is the assignment of natural language texts to predefined categories based on their content. It has often been observed that compression seems to provide a very promising approach to categorization. The overall compression of an article with respect to different models can be compared to see which one it fits most closely. Such a scheme has several potential advantages because it does not require any pre-processing of the input text.We have performed extensive experiments on the use of PPM compression models for categorization using the standard Reuters-21578 dataset. We obtained some encouraging results on two-category situations, and the results on the general problem seem reasonably impressive---in one case outstanding. However, we find that PPM does not compete with the published state of the art in the use of machine learning for text categorization. It produces inferior results because it is insensitive to subtle differences between articles that belong to a category and those that do not.We do not believe our results are specific to PPM. If the occurrence of a single word determines whether an article belongs to a category or not (and it often does) any compression scheme will likely fail to classify the article correctly. Machine learning schemes fare better because they automatically eliminate irrelevant features and concentrate on the most discriminating ones.