Text Categorization Using Compression Models
DCC '00 Proceedings of the Conference on Data Compression
A language and character set determination method based on N-gram statistics
ACM Transactions on Asian Language Information Processing (TALIP)
Text Categorization Using Compression Models
DCC '00 Proceedings of the Conference on Data Compression
A repetition based measure for verification of text collections and for text categorization
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Extracting key-substring-group features for text classification
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Compression-based data mining of sequential data
Data Mining and Knowledge Discovery
Spam Filtering Using Statistical Data Compression Models
The Journal of Machine Learning Research
Artificial Intelligence Review
Fast logistic regression for text categorization with variable-length n-grams
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Dictionary based color image retrieval
Journal of Visual Communication and Image Representation
Catching the Drift: Using Feature-Free Case-Based Reasoning for Spam Filtering
ICCBR '07 Proceedings of the 7th international conference on Case-Based Reasoning: Case-Based Reasoning Research and Development
Malware detection using adaptive data compression
Proceedings of the 1st ACM workshop on Workshop on AISec
Compression and stylometry for author identification
IJCNN'09 Proceedings of the 2009 international joint conference on Neural Networks
On compression-based text classification
ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Language identification for creating language-specific Twitter collections
LSM '12 Proceedings of the Second Workshop on Language in Social Media
Legal documents categorization by compression
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Law
Hi-index | 0.00 |
Text categorization is the assignment of natural language texts to predefined categories based on their content. It has often been observed that compression seems to provide a very promising approach to categorization. The overall compression of an article with respect to different models can be compared to see which one it fits most closely. Such a scheme has several potential advantages because it does not require any pre-processing of the input text.We have performed extensive experiments on the use of PPM compression models for categorization using the standard Reuters-21578 dataset. We obtained some encouraging results on two-category situations, and the results on the general problem seem reasonably impressive---in one case outstanding. However, we find that PPM does not compete with the published state of the art in the use of machine learning for text categorization. It produces inferior results because it is insensitive to subtle differences between articles that belong to a category and those that do not.We do not believe our results are specific to PPM. If the occurrence of a single word determines whether an article belongs to a category or not (and it often does) any compression scheme will likely fail to classify the article correctly. Machine learning schemes fare better because they automatically eliminate irrelevant features and concentrate on the most discriminating ones.