Text classification based on the bias of word frequency over categories

Authors:
Makoto Suzuki
Affiliations:
Department of Information Science, Shonan Institute of Technology, Fujisawa, Kanagawa, Japan
Venue:
AIA'06 Proceedings of the 24th IASTED international conference on Artificial intelligence and applications
Year:
2006

Citing 14
Cited 0

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
The nature of statistical learning theory

The nature of statistical learning theory
Using corpus statistics to remove redundant words in text categorization

Journal of the American Society for Information Science
Self-organizing maps

Self-organizing maps
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
The feature quantity: an information theoretic perspective of Tfidf-like measures

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Kernel Methods for Pattern Analysis

Kernel Methods for Pattern Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

In automatic text classification, for example, for classifying newspaper articles into predefined categories such as politics and sports, the crucial step is how to select appropriate keywords. With traditional classification methods based on the vector space model, frequent words are emphasized and therefore low-frequency words tend to be disregarded. However, there often exist low-frequency words that are effective for classification. For instance, technical terms appear in specific categories so their frequencies are generally low, even though they are effective keywords. In this paper, we propose two text classification methods, namely, NDF method and accumulation method, that are based on the bias of word frequency distribution over categories. Our experiments show that our accumulation method outperforms a traditional method based on the vector space model.