Training algorithms for linear text classifiers
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Rough Sets: Theoretical Aspects of Reasoning about Data
Rough Sets: Theoretical Aspects of Reasoning about Data
Information Retrieval
Feature Extraction, Construction and Selection: A Data Mining Perspective
Feature Extraction, Construction and Selection: A Data Mining Perspective
Rough-Fuzzy Hybridization: A New Trend in Decision Making
Rough-Fuzzy Hybridization: A New Trend in Decision Making
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Rough Set-Based Approach to Text Classification
RSFDGrC '99 Proceedings of the 7th International Workshop on New Directions in Rough Sets, Data Mining, and Granular-Soft Computing
A Rough Set-Based Hybrid Method to Text Categorization
WISE '01 Proceedings of the Second International Conference on Web Information Systems Engineering (WISE'01) Volume 1 - Volume 1
A divisive information theoretic feature clustering algorithm for text classification
The Journal of Machine Learning Research
Margin based feature selection - theory and algorithms
ICML '04 Proceedings of the twenty-first international conference on Machine learning
An efficient classifier design integrating rough set and set oriented database operations
Applied Soft Computing
Hi-index | 0.00 |
A major obstacle that decreases the performance of text classifiers is the extremely high dimensionality of text data. To reduce the dimension, a number of approaches based on rough-set theory have been proposed. However, these works often suffer from two problems: the first is that they cannot directly deal with continuous text features; the second is that they often incur considerable running time. To deal with the first issue, we make some extensions to discernibility matrix so that it can work with continuous features. To cut down running time, we employ centroids rather than examples to construct discernibility matrix, which reduce the time complexity from O(T2W) to O(K2W) where T denotes the size of training examples, K denotes the number of training classes and W denotes the size of vocabulary. The experimental results indicate that proposed method not only yields much higher accuracy than Information Gain when the number of selected features is smaller than 6000, but also incurs much smaller CPU time than Information Gain.