Text categorization with class-based and corpus-based keyword selection

Authors:
Arzucan Özgür;Levent Özgür;Tunga Güngör
Affiliations:
Department of Computer Engineering, Boğaziçi University, Bebek, İstanbul, Turkey;Department of Computer Engineering, Boğaziçi University, Bebek, İstanbul, Turkey;Department of Computer Engineering, Boğaziçi University, Bebek, İstanbul, Turkey
Venue:
ISCIS'05 Proceedings of the 20th international conference on Computer and Information Sciences
Year:
2005

Citing 12
Cited 12

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Extracting classification knowledge of Internet documents with mining term associations: a semantic approach

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Making large-scale support vector machine learning practical

Advances in kernel methods
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing

Communications of the ACM
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Adaptive anti-spam filtering for agglutinative languages: a special case for Turkish

Pattern Recognition Letters

A study on automatically extracted keywords in text categorization

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
A parameter-free hybrid clustering algorithm used for malware categorization

ASID'09 Proceedings of the 3rd international conference on Anti-Counterfeiting, security, and identification in communication
Analytical evaluation of term weighting schemes for text categorization

Pattern Recognition Letters
Text classification with the support of pruned dependency patterns

Pattern Recognition Letters
Automatically computed document dependent weighting factor facility for Naïve Bayes classification

Expert Systems with Applications: An International Journal
Does negation really matter?

NeSp-NLP '10 Proceedings of the Workshop on Negation and Speculation in Natural Language Processing
Text categorization methods for automatic estimation of verbal intelligence

Expert Systems with Applications: An International Journal
Text categorization based on fuzzy soft set theory

ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part IV
A three-phase method for patent classification

Information Processing and Management: an International Journal
Comparison of text feature selection policies and using an adaptive framework

Expert Systems with Applications: An International Journal
Pacc - a discriminative and accuracy correlated measure for assessment of classification results

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition
Random walks based modularity: application to semi-supervised learning

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we examine the use of keywords in text categorization with SVM. In contrast to the usual belief, we reveal that using keywords instead of all words yields better performance both in terms of accuracy and time. Unlike the previous studies that focus on keyword selection metrics, we compare the two approaches for keyword selection. In corpus-based approach, a single set of keywords is selected for all classes. In class-based approach, a distinct set of keywords is selected for each class. We perform the experiments with the standard Reuters-21578 dataset, with both boolean and tf-idf weighting. Our results show that although tf-idf weighting performs better, boolean weighting can be used where time and space resources are limited. Corpus-based approach with 2000 keywords performs the best. However, for small number of keywords, class-based approach outperforms the corpus-based approach with the same number of keywords.