An efficient feature ranking measure for text categorization

Authors:
Songbo Tan;Yuefen Wang;Xueqi Cheng
Affiliations:
Chinese Academy of Sciences, China;Chinese Academy of Geological Sciences, China;Chinese Academy of Sciences, China
Venue:
Proceedings of the 2008 ACM symposium on Applied computing
Year:
2008

Citing 13
Cited 1

Training algorithms for linear text classifiers

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Rough Sets: Theoretical Aspects of Reasoning about Data

Rough Sets: Theoretical Aspects of Reasoning about Data
Information Retrieval

Information Retrieval
Feature Extraction, Construction and Selection: A Data Mining Perspective

Feature Extraction, Construction and Selection: A Data Mining Perspective
Rough-Fuzzy Hybridization: A New Trend in Decision Making

Rough-Fuzzy Hybridization: A New Trend in Decision Making
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Rough Set-Based Approach to Text Classification

RSFDGrC '99 Proceedings of the 7th International Workshop on New Directions in Rough Sets, Data Mining, and Granular-Soft Computing
A Rough Set-Based Hybrid Method to Text Categorization

WISE '01 Proceedings of the Second International Conference on Web Information Systems Engineering (WISE'01) Volume 1 - Volume 1
A divisive information theoretic feature clustering algorithm for text classification

The Journal of Machine Learning Research
Margin based feature selection - theory and algorithms

ICML '04 Proceedings of the twenty-first international conference on Machine learning

An efficient classifier design integrating rough set and set oriented database operations

Applied Soft Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

A major obstacle that decreases the performance of text classifiers is the extremely high dimensionality of text data. To reduce the dimension, a number of approaches based on rough-set theory have been proposed. However, these works often suffer from two problems: the first is that they cannot directly deal with continuous text features; the second is that they often incur considerable running time. To deal with the first issue, we make some extensions to discernibility matrix so that it can work with continuous features. To cut down running time, we employ centroids rather than examples to construct discernibility matrix, which reduce the time complexity from O(T2W) to O(K2W) where T denotes the size of training examples, K denotes the number of training classes and W denotes the size of vocabulary. The experimental results indicate that proposed method not only yields much higher accuracy than Information Gain when the number of selected features is smaller than 6000, but also incurs much smaller CPU time than Information Gain.