Classification of skewed and homogenous document corpora with class-based and corpus-based keywords

Authors:
Arzucan Özgür;Tunga Güngör
Affiliations:
Boǧaziçi University, Computer Engineering Department, Bebek, Istanbul, Turkey;Boǧaziçi University, Computer Engineering Department, Bebek, Istanbul, Turkey
Venue:
KI'06 Proceedings of the 29th annual German conference on Artificial intelligence
Year:
2006

Citing 12
Cited 3

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
WebACE: a Web agent for document categorization and exploration

AGENTS '98 Proceedings of the second international conference on Autonomous agents
Making large-scale support vector machine learning practical

Advances in kernel methods
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Supervised term weighting for automated text categorization

Proceedings of the 2003 ACM symposium on Applied computing
Adaptive anti-spam filtering for agglutinative languages: a special case for Turkish

Pattern Recognition Letters

Mining fuzzy frequent itemsets for hierarchical document clustering

Information Processing and Management: an International Journal
A multi-class SVM classification system based on learning methods from indistinguishable chinese official documents

Expert Systems with Applications: An International Journal
Comparison of text feature selection policies and using an adaptive framework

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we examine the performance of the two policies for keyword selection over standard document corpora of varying properties. While in corpus-based policy a single set of keywords is selected for all classes globally, in class-based policy a distinct set of keywords is selected for each class locally. We use SVM as the learning method and perform experiments with boolean and tf-idf weighting. In contrast to the common belief, we show that using keywords instead of all words generally yields better performance and tf-idf weighting does not always outperform boolean weighting. Our results reveal that corpus-based approach performs better for large number of keywords while class-based approach performs better for small number of keywords. In skewed datasets, class-based keyword selection performs consistently better than corpus-based approach in terms of macro-averaged F-measure. In homogenous datasets, performances of class-based and corpus-based approaches are similar except for small number of keywords.