Exploiting Category Information and Document Information to Improve Term Weighting for Text Categorization

Authors:
Jingyang Li;Maosong Sun
Affiliations:
National Laboratory of Intelligent Technology and Systems, Dept. of Computer Sci. & Tech., Tsinghua University, Beijing 100084, China;National Laboratory of Intelligent Technology and Systems, Dept. of Computer Sci. & Tech., Tsinghua University, Beijing 100084, China
Venue:
CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Year:
2009

Citing 13
Cited 0

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
The nature of statistical learning theory

The nature of statistical learning theory
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
The feature quantity: an information theoretic perspective of Tfidf-like measures

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Modern Information Retrieval

Modern Information Retrieval
High-performing feature selection for text classification

Proceedings of the eleventh international conference on Information and knowledge management
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization

ECDL '00 Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries
An analysis of the relative hardness of Reuters-21578 subsets: Research Articles

Journal of the American Society for Information Science and Technology
A comparison and semi-quantitative analysis of words and character-bigrams as features in Chinese text categorization

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Traditional tfidf-like term weighting schemes have a rough statistic -- idfas the term weighting factor, which does not exploit the category information (category labels on documents) and intra-document information (the relative importance of a given term to a given document that contains it) from the training data for a text categorization task. We present here a more elaborate nonparametric probabilistic model to make use of this sort of information in the term weighting phase. idfis theoretically proved to be a rough approximation of this new term weighting factor. This work is preliminary and mainly aiming at providing inspiration for further study on exploitation of this information, but it already provides a moderate performance boost on three popular document collections.