An Evaluation of Statistical Approaches to Text Categorization
Information Retrieval
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Text Categorization Based on Regularized Linear Classification Methods
Information Retrieval
Using Literal and Grammatical Statistics for Authorship Attribution
Problems of Information Transmission
Improving the Efficiency of the PPM Algorithm
Problems of Information Transmission
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Text Categorization Using Compression Models
DCC '00 Proceedings of the Conference on Data Compression
A repetition based measure for verification of text collections and for text categorization
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
DNA Sequence Classification Using Compression-Based Induction
DNA Sequence Classification Using Compression-Based Induction
Augmenting Naive Bayes Classifiers with Statistical Language Models
Information Retrieval
An empirical study of smoothing techniques for language modeling
ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Language and task independent text categorization with simple language models
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Combining naive bayes and n-gram language models for text classification
ECIR'03 Proceedings of the 25th European conference on IR research
Extracting key-substring-group features for text classification
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Tensor Space Models for Authorship Identification
SETN '08 Proceedings of the 5th Hellenic conference on Artificial Intelligence: Theories, Models and Applications
A survey of modern authorship attribution methods
Journal of the American Society for Information Science and Technology
Compression-based document length prior for language models
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Forensic Authorship Attribution Using Compression Distances to Prototypes
IWCF '09 Proceedings of the 3rd International Workshop on Computational Forensics
New filtering approaches for phishing email
Journal of Computer Security - EU-Funded ICT Research on Trust and Security
Tweet classification by data compression
Proceedings of the 2011 international workshop on DETecting and Exploiting Cultural diversiTy on the social web
N-Gram feature selection for authorship identification
AIMSA'06 Proceedings of the 12th international conference on Artificial Intelligence: methodology, Systems, and Applications
Legal documents categorization by compression
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Law
Hi-index | 0.00 |
Compression-based text classification methods are easy to apply, requiring virtually no preprocessing of the data. Most such methods are character-based, and thus have the potential to automatically capture non-word features of a document, such as punctuation, word-stems, and features spanning more than one word. However, compression-based classification methods have drawbacks (such as slow running time), and not all such methods are equally effective. We present the results of a number of experiments designed to evaluate the effectiveness and behavior of different compression-based text classification methods on English text. Among our experiments are some specifically designed to test whether the ability to capture non-word (including super-word) features causes character-based text compression methods to achieve more accurate classification.