Recent trends in hierarchic document clustering: a critical review
Information Processing and Management: an International Journal
Cluster-based text categorization: a comparison of category search strategies
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
WordNet: a lexical database for English
Communications of the ACM
Information Retrieval
High-performing feature selection for text classification
Proceedings of the eleventh international conference on Information and knowledge management
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Discovering word senses from text
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A Vector Space Model for Automatic Indexing
A Vector Space Model for Automatic Indexing
Extracting the lowest-frequency words: pitfalls and possibilities
Computational Linguistics
Non-word identification or spell checking without a dictionary
Journal of the American Society for Information Science and Technology
RCV1: A New Benchmark Collection for Text Categorization Research
The Journal of Machine Learning Research
Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence
IV '04 Proceedings of the Information Visualisation, Eighth International Conference
The clustering power of low frequency words in academic Webs: Brief Communication
Journal of the American Society for Information Science and Technology
A statistical approach to mechanized encoding and searching of literary information
IBM Journal of Research and Development
Does SVM really scale up to large bag of words feature spaces?
IDA'07 Proceedings of the 7th international conference on Intelligent data analysis
Hi-index | 0.00 |
One of the first steps of document classification, clustering and many other information retrieval tasks is to discard words occurring only a few times in the corpus, based on the assumption that they have little contribution to the bag of words representation. However, as we will show, rare n-grams and other similar features are able to indicate surprisingly well if two documents belong to the same category, and thus can aid classification. In our experiments over four corpora, we found that while keeping the size of the training set constant, 5-25% of the test set can be classified essentially for free based on rare features without any loss of accuracy, even experiencing an improvement of 0.6-1.6%.