An evaluation of phrasal and clustered representations on a text categorization task
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Automated learning of decision rules for text categorization
ACM Transactions on Information Systems (TOIS)
Distributional clustering of words for text classification
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Four text classification algorithms compared on a Dutch corpus
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Context-sensitive learning methods for text categorization
ACM Transactions on Information Systems (TOIS)
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
General Convergence Results for Linear Discriminant Updates
Machine Learning
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Uncertainty and term selection in text categorization
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
ECIR'03 Proceedings of the 25th European conference on IR research
Chinese text categorization based on the binary weighting model with non-binary smoothing
ECIR'03 Proceedings of the 25th European conference on IR research
On the importance of parameter tuning in text categorization
PSI'06 Proceedings of the 6th international Andrei Ershov memorial conference on Perspectives of systems informatics
Hi-index | 0.00 |
This paper introduces a new criterium for term selection, which is based on the notion of Uncertainty. Term selection according to this criterium is performed by the elimination of noisy terms on a class-by-class basis, rather than by selecting the most significant ones. Uncertainty-based term selection (UC) is compared to a number of other criteria like Information Gain (IG), simplified 驴2 (SX), Term Frequency (TF) and Document Frequency (DF) in a Text Categorization setting. Experiments on data sets with different properties (Reuters- 21578, patent abstracts and patent applications) and with two different algorithms (Winnow and Rocchio) show that UC-based term selection is not the most aggressive term selection criterium, but that its effect is quite stable across data sets and algorithms. This makes it a good candidate for a general "install-and-forget" term selection mechanism. We also describe and evaluate a hybrid Term Selection technique, first applying UC to eliminate noisy terms and then using another criterium to select the best terms.