Feature selection methods for text classification

Authors:
Anirban Dasgupta;Petros Drineas;Boulos Harb;Vanja Josifovski;Michael W. Mahoney
Affiliations:
Yahoo Research;RPI;U Penn;Yahoo Research;Yahoo Research
Venue:
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2007

Citing 28
Cited 14

Jacobi's method is more accurage than QR

SIAM Journal on Matrix Analysis and Applications
Original Contribution: Training a 3-node neural network is NP-complete

Neural Networks
Noise reduction in a statistical approach to text categorization

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Selection of relevant features and examples in machine learning

Artificial Intelligence - Special issue on relevance
Wrappers for feature subset selection

Artificial Intelligence - Special issue on relevance
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Least Squares Support Vector Machine Classifiers

Neural Processing Letters
Proximal support vector machine classifiers

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Text Categorization Based on Regularized Linear Classification Methods

Information Retrieval
Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?

Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Shrinkage estimator generalizations of Proximal Support Vector Machines

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Integrating feature and instance selection for text classification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Combinatorial feature selection problems

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Robustness of regularized linear classification methods in text categorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Everything old is new again: a fresh look at historical approaches in machine learning

Everything old is new again: a fresh look at historical approaches in machine learning
An introduction to variable and feature selection

The Journal of Machine Learning Research
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Parameterized generation of labeled datasets for text categorization based on a hierarchical directory

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5

ICML '04 Proceedings of the twenty-first international conference on Machine learning
SVM vs Regularized Least Squares Classification

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 1 - Volume 01
Dimension Reduction in Text Classification with Support Vector Machines

The Journal of Machine Learning Research
Sampling algorithms for l2 regression and applications

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Sampling algorithms and coresets for ℓp regression

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms

A bottom-up approach for XML documents classification

IDEAS '08 Proceedings of the 2008 international symposium on Database engineering & applications
Feature selection with dynamic mutual information

Pattern Recognition
A General Framework of Feature Selection for Text Categorization

MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
Differential Tag Clouds: Highlighting Particular Features in Documents

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Rank Aggregation Based Text Feature Selection

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Classifying Documents According to Locational Relevance

EPIA '09 Proceedings of the 14th Portuguese Conference on Artificial Intelligence: Progress in Artificial Intelligence
Mining positive and negative patterns for relevance feature discovery

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Cross-lingual text classification with model translation and document translation

Proceedings of the 50th Annual Southeast Regional Conference
Representation models for text classification: a comparative analysis over three web document types

Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
A novel feature selection method based on normalized mutual information

Applied Intelligence
Comparison of text feature selection policies and using an adaptive framework

Expert Systems with Applications: An International Journal
Variational learning of a Dirichlet process of generalized Dirichlet distributions for simultaneous clustering and feature selection

Pattern Recognition
SVOIS: Support Vector Oriented Instance Selection for text classification

Information Systems
Evolutionary instance selection for text classification

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider feature selection for text classification both theoretically and empirically. Our main result is an unsupervised feature selection strategy for which we give worst-case theoretical guarantees on the generalization power of the resultant classification function f with respect to the classification function f obtained when keeping all the features. To the best of our knowledge, this is the first feature selection method with such guarantees. In addition, the analysis leads to insights as to when and why this feature selection strategy will perform well in practice. We then use the TechTC-100, 20-Newsgroups, and Reuters-RCV2 data sets to evaluate empirically the performance of this and two simpler but related feature selection strategies against two commonly-used strategies. Our empirical evaluation shows that the strategy with provable performance guarantees performs well in comparison with other commonly-used feature selection strategies. In addition, it performs better on certain datasets under very aggressive feature selection.