Exploiting extremely rare features in text categorization

Authors:
Péter Schönhofen;András A. Benczúr
Affiliations:
Informatics Laboratory, Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest;Informatics Laboratory, Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest
Venue:
ECML'06 Proceedings of the 17th European conference on Machine Learning
Year:
2006

Citing 15
Cited 1

Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
Cluster-based text categorization: a comparison of category search strategies

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
WordNet: a lexical database for English

Communications of the ACM
Information Retrieval

Information Retrieval
High-performing feature selection for text classification

Proceedings of the eleventh international conference on Information and knowledge management
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Discovering word senses from text

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A Vector Space Model for Automatic Indexing

A Vector Space Model for Automatic Indexing
Extracting the lowest-frequency words: pitfalls and possibilities

Computational Linguistics
Non-word identification or spell checking without a dictionary

Journal of the American Society for Information Science and Technology
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence

IV '04 Proceedings of the Information Visualisation, Eighth International Conference
The clustering power of low frequency words in academic Webs: Brief Communication

Journal of the American Society for Information Science and Technology
A statistical approach to mechanized encoding and searching of literary information

IBM Journal of Research and Development

Does SVM really scale up to large bag of words feature spaces?

IDA'07 Proceedings of the 7th international conference on Intelligent data analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the first steps of document classification, clustering and many other information retrieval tasks is to discard words occurring only a few times in the corpus, based on the assumption that they have little contribution to the bag of words representation. However, as we will show, rare n-grams and other similar features are able to indicate surprisingly well if two documents belong to the same category, and thus can aid classification. In our experiments over four corpora, we found that while keeping the size of the training set constant, 5-25% of the test set can be classified essentially for free based on rare features without any loss of accuracy, even experiencing an improvement of 0.6-1.6%.