On compression-based text classification

Authors:
Yuval Marton;Ning Wu;Lisa Hellerstein
Affiliations:
Department of Linguistics, University of Maryland, College Park, MD;Department of Computer and Information Science, Polytechnic University, Brooklyn, NY;Department of Computer and Information Science, Polytechnic University, Brooklyn, NY
Venue:
ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Year:
2005

Citing 13
Cited 9

An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Text Categorization Based on Regularized Linear Classification Methods

Information Retrieval
Using Literal and Grammatical Statistics for Authorship Attribution

Problems of Information Transmission
Improving the Efficiency of the PPM Algorithm

Problems of Information Transmission
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Text Categorization Using Compression Models

DCC '00 Proceedings of the Conference on Data Compression
A repetition based measure for verification of text collections and for text categorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
DNA Sequence Classification Using Compression-Based Induction

DNA Sequence Classification Using Compression-Based Induction
Augmenting Naive Bayes Classifiers with Statistical Language Models

Information Retrieval
An empirical study of smoothing techniques for language modeling

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Language and task independent text categorization with simple language models

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Combining naive bayes and n-gram language models for text classification

ECIR'03 Proceedings of the 25th European conference on IR research

Extracting key-substring-group features for text classification

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Tensor Space Models for Authorship Identification

SETN '08 Proceedings of the 5th Hellenic conference on Artificial Intelligence: Theories, Models and Applications
A survey of modern authorship attribution methods

Journal of the American Society for Information Science and Technology
Compression-based document length prior for language models

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Forensic Authorship Attribution Using Compression Distances to Prototypes

IWCF '09 Proceedings of the 3rd International Workshop on Computational Forensics
New filtering approaches for phishing email

Journal of Computer Security - EU-Funded ICT Research on Trust and Security
Tweet classification by data compression

Proceedings of the 2011 international workshop on DETecting and Exploiting Cultural diversiTy on the social web
N-Gram feature selection for authorship identification

AIMSA'06 Proceedings of the 12th international conference on Artificial Intelligence: methodology, Systems, and Applications
Legal documents categorization by compression

Proceedings of the Fourteenth International Conference on Artificial Intelligence and Law

Quantified Score

Hi-index	0.00

Visualization

Abstract

Compression-based text classification methods are easy to apply, requiring virtually no preprocessing of the data. Most such methods are character-based, and thus have the potential to automatically capture non-word features of a document, such as punctuation, word-stems, and features spanning more than one word. However, compression-based classification methods have drawbacks (such as slow running time), and not all such methods are equally effective. We present the results of a number of experiments designed to evaluate the effectiveness and behavior of different compression-based text classification methods on English text. Among our experiments are some specifically designed to test whether the ability to capture non-word (including super-word) features causes character-based text compression methods to achieve more accurate classification.