Integrating feature and instance selection for text classification

Authors:
Dimitris Fragoudis;Dimitris Meretakis;Spiros Likothanassis
Affiliations:
University of Patras, Rio GR-26500, Greece;Zurich Financial Services, Switzerland;University of Patras, Rio GR-26500, Greece
Venue:
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2002

Citing 12
Cited 14

A sequential algorithm for training text classifiers

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Little words can make a big difference for text classification

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Selection of relevant features and examples in machine learning

Artificial Intelligence - Special issue on relevance
Lazy learning

Lazy learning
Bayesian Network Classifiers

Machine Learning - Special issue on learning with probabilistic representations
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Scalable association-based text classification

Proceedings of the ninth international conference on Information and knowledge management
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Instance Pruning Techniques

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Less is More: Active Learning with Support Vector Machines

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Feature selection and feature extraction for text categorization

HLT '91 Proceedings of the workshop on Speech and Natural Language

Efficient multi-way text categorization via generalized discriminant analysis

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Feature selection methods for text classification

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Floatcascade learning for fast imbalanced web mining

Proceedings of the 17th international conference on World Wide Web
Text classification: a recent overview

ICCOMP'05 Proceedings of the 9th WSEAS International Conference on Computers
Spiral removal of exceptional patients for mining chronic hepatitis data

New Generation Computing
Text categorization via generalized discriminant analysis

Information Processing and Management: an International Journal
A First Study on the Use of Coevolutionary Algorithms for Instance and Feature Selection

HAIS '09 Proceedings of the 4th International Conference on Hybrid Artificial Intelligence Systems
On strategies for imbalanced text classification using SVM: A comparative study

Decision Support Systems
A rough set-based case-based reasoner for text categorization

International Journal of Approximate Reasoning
FISA: feature-based instance selection for imbalanced text classification

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Multi-strategy instance selection in mining chronic hepatitis data

ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems
Two way focused classification

DaWaK'07 Proceedings of the 9th international conference on Data Warehousing and Knowledge Discovery
Genetic algorithms in feature and instance selection

Knowledge-Based Systems
Variational learning of a Dirichlet process of generalized Dirichlet distributions for simultaneous clustering and feature selection

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Instance selection and feature selection are two orthogonal methods for reducing the amount and complexity of data. Feature selection aims at the reduction of redundant features in a dataset whereas instance selection aims at the reduction of the number of instances. So far, these two methods have mostly been considered in isolation. In this paper, we present a new algorithm, which we call FIS (Feature and Instance Selection) that targets both problems simultaneously in the context of text classificationOur experiments on the Reuters and 20-Newsgroups datasets show that FIS considerably reduces both the number of features and the number of instances. The accuracy of a range of classifiers including Naïve Bayes, TAN and LB considerably improves when using the FIS preprocessed datasets, matching and exceeding that of Support Vector Machines, which is currently considered to be one of the best text classification methods. In all cases the results are much better compared to Mutual Information based feature selection. The training and classification speed of all classifiers is also greatly improved.