Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
Automated learning of decision rules for text categorization
ACM Transactions on Information Systems (TOIS)
The nature of statistical learning theory
The nature of statistical learning theory
Using corpus statistics to remove redundant words in text categorization
Journal of the American Society for Information Science
Self-organizing maps
A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
The feature quantity: an information theoretic perspective of Tfidf-like measures
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
An introduction to support Vector Machines: and other kernel-based learning methods
An introduction to support Vector Machines: and other kernel-based learning methods
An Evaluation of Statistical Approaches to Text Categorization
Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Pattern Classification (2nd Edition)
Pattern Classification (2nd Edition)
Kernel Methods for Pattern Analysis
Kernel Methods for Pattern Analysis
Hi-index | 0.00 |
In automatic text classification, for example, for classifying newspaper articles into predefined categories such as politics and sports, the crucial step is how to select appropriate keywords. With traditional classification methods based on the vector space model, frequent words are emphasized and therefore low-frequency words tend to be disregarded. However, there often exist low-frequency words that are effective for classification. For instance, technical terms appear in specific categories so their frequencies are generally low, even though they are effective keywords. In this paper, we propose two text classification methods, namely, NDF method and accumulation method, that are based on the bias of word frequency distribution over categories. Our experiments show that our accumulation method outperforms a traditional method based on the vector space model.