Uncertainty and term selection in text categorization

Authors:
Charles M. E. E. Peters;Cornelis H. A. Koster
Affiliations:
Department of Computer Science, University of Nijmegen, Nijmegen, The Netherlands;Department of Computer Science, University of Nijmegen, Nijmegen, The Netherlands
Venue:
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
Year:
2003

Citing 9
Cited 0

Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Four text classification algorithms compared on a Dutch corpus

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Context-sensitive learning methods for text categorization

ACM Transactions on Information Systems (TOIS)
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
General Convergence Results for Linear Discriminant Updates

Machine Learning
Uncertainty-Based Noise Reduction and Term Selection in Text Categorization

Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
Term Frequency Normalization via Pareto Distributions

Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper discusses the notion of Uncertainty, which has a prominent place in the theory and experimental practice of modern Physics. It argues that the awareness of Uncertainty may also be of tremendous importance to the field of Information Retrieval, and in particular Text Categorization.As an application of Uncertainty in Text Categorization, a new criterion for Term Selection is described, which is based on the Uncertainty in Term Frequency across categories. This criterion allows to distinguish between low-quality (or "noisy") and high-quality ("stiff") terms.We describe an experiment investigating the effect of eliminating noisy and stiff terms in the context of text classification. In the experiment we applied the Rocchio and Winnow classification algorithms to a collection of newspaper items, a mono-classified subset of the well-known Reuters 21578 corpus.This investigation shows that both the local elimination of noisy terms and the global elimination of stiff terms can be used for Term Selection in Text Categorization.