Text Categorization Using Compression Models

Authors:
Eibe Frank;Chang Chui;Ian H. Witten
Affiliations:
-;-;-
Venue:
DCC '00 Proceedings of the Conference on Data Compression
Year:
2000

Citing 1
Cited 15

Text Categorization Using Compression Models

DCC '00 Proceedings of the Conference on Data Compression

A language and character set determination method based on N-gram statistics

ACM Transactions on Asian Language Information Processing (TALIP)
Text Categorization Using Compression Models

DCC '00 Proceedings of the Conference on Data Compression
A repetition based measure for verification of text collections and for text categorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Extracting key-substring-group features for text classification

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Compression-based data mining of sequential data

Data Mining and Knowledge Discovery
Spam Filtering Using Statistical Data Compression Models

The Journal of Machine Learning Research
Textual case-based reasoning for spam filtering: a comparison of feature-based and feature-free approaches

Artificial Intelligence Review
Fast logistic regression for text categorization with variable-length n-grams

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Dictionary based color image retrieval

Journal of Visual Communication and Image Representation
Catching the Drift: Using Feature-Free Case-Based Reasoning for Spam Filtering

ICCBR '07 Proceedings of the 7th international conference on Case-Based Reasoning: Case-Based Reasoning Research and Development
Malware detection using adaptive data compression

Proceedings of the 1st ACM workshop on Workshop on AISec
Compression and stylometry for author identification

IJCNN'09 Proceedings of the 2009 international joint conference on Neural Networks
On compression-based text classification

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Language identification for creating language-specific Twitter collections

LSM '12 Proceedings of the Second Workshop on Language in Social Media
Legal documents categorization by compression

Proceedings of the Fourteenth International Conference on Artificial Intelligence and Law

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text categorization is the assignment of natural language texts to predefined categories based on their content. It has often been observed that compression seems to provide a very promising approach to categorization. The overall compression of an article with respect to different models can be compared to see which one it fits most closely. Such a scheme has several potential advantages because it does not require any pre-processing of the input text.We have performed extensive experiments on the use of PPM compression models for categorization using the standard Reuters-21578 dataset. We obtained some encouraging results on two-category situations, and the results on the general problem seem reasonably impressive---in one case outstanding. However, we find that PPM does not compete with the published state of the art in the use of machine learning for text categorization. It produces inferior results because it is insensitive to subtle differences between articles that belong to a category and those that do not.We do not believe our results are specific to PPM. If the occurrence of a single word determines whether an article belongs to a category or not (and it often does) any compression scheme will likely fail to classify the article correctly. Machine learning schemes fare better because they automatically eliminate irrelevant features and concentrate on the most discriminating ones.