FISA: feature-based instance selection for imbalanced text classification

Authors:
Aixin Sun;Ee-Peng Lim;Boualem Benatallah;Mahbub Hassan
Affiliations:
School of Computer Engineering, Nanyang Technological University, Singapore;School of Computer Engineering, Nanyang Technological University, Singapore;School of Computer Science and Engineering, University of New South Wales, NSW, Australia;School of Computer Science and Engineering, University of New South Wales, NSW, Australia
Venue:
PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Year:
2006

Citing 7
Cited 3

On Issues of Instance Selection

Data Mining and Knowledge Discovery
Integrating feature and instance selection for text classification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
One-class svms for document classification

The Journal of Machine Learning Research
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Extreme re-balancing for SVMs: a case study

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
KBA: Kernel Boundary Alignment Considering Imbalanced Data Distribution

IEEE Transactions on Knowledge and Data Engineering
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research

Floatcascade learning for fast imbalanced web mining

Proceedings of the 17th international conference on World Wide Web
On strategies for imbalanced text classification using SVM: A comparative study

Decision Support Systems
Sample cutting method for imbalanced text sentiment classification based on BRC

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Support Vector Machines (SVM) classifiers are widely used in text classification tasks and these tasks often involve imbalanced training. In this paper, we specifically address the cases where negative training documents significantly outnumber the positive ones. A generic algorithm known as FISA (Feature-based Instance Selection Algorithm), is proposed to select only a subset of negative training documents for training a SVM classifier. With a smaller carefully selected training set, a SVM classifier can be more efficiently trained while delivering comparable or better classification accuracy. In our experiments on the 20-Newsgroups dataset, using only 35% negative training examples and 60% learning time, methods based on FISA delivered much better classification accuracy than those methods using all negative training documents.