Uncertainty-Based Noise Reduction and Term Selection in Text Categorization

Authors:
C. Peters;Cornelis H. A. Koster
Affiliations:
-;-
Venue:
Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
Year:
2002

Citing 8
Cited 4

An evaluation of phrasal and clustered representations on a text categorization task

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Four text classification algorithms compared on a Dutch corpus

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Context-sensitive learning methods for text categorization

ACM Transactions on Information Systems (TOIS)
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
General Convergence Results for Linear Discriminant Updates

Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning

Uncertainty and term selection in text categorization

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
Taming wild phrases

ECIR'03 Proceedings of the 25th European conference on IR research
Chinese text categorization based on the binary weighting model with non-binary smoothing

ECIR'03 Proceedings of the 25th European conference on IR research
On the importance of parameter tuning in text categorization

PSI'06 Proceedings of the 6th international Andrei Ershov memorial conference on Perspectives of systems informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper introduces a new criterium for term selection, which is based on the notion of Uncertainty. Term selection according to this criterium is performed by the elimination of noisy terms on a class-by-class basis, rather than by selecting the most significant ones. Uncertainty-based term selection (UC) is compared to a number of other criteria like Information Gain (IG), simplified 驴2 (SX), Term Frequency (TF) and Document Frequency (DF) in a Text Categorization setting. Experiments on data sets with different properties (Reuters- 21578, patent abstracts and patent applications) and with two different algorithms (Winnow and Rocchio) show that UC-based term selection is not the most aggressive term selection criterium, but that its effect is quite stable across data sets and algorithms. This makes it a good candidate for a general "install-and-forget" term selection mechanism. We also describe and evaluate a hybrid Term Selection technique, first applying UC to eliminate noisy terms and then using another criterium to select the best terms.