On strategies for imbalanced text classification using SVM: A comparative study

Authors:
Aixin Sun;Ee-Peng Lim;Ying Liu
Affiliations:
School of Computer Engineering, Nanyang Technological University, Singapore;School of Information Systems, Singapore Management University, Singapore;Department of Industrial and Systems Engineering, Hong Kong Polytechnic University, Hong Kong
Venue:
Decision Support Systems
Year:
2009

Citing 30
Cited 14

The nature of statistical learning theory

The nature of statistical learning theory
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Pairwise classification and support vector machines

Advances in kernel methods
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A study of thresholding strategies for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Web classification using support vector machine

Proceedings of the 4th international workshop on Web information and data management
On Issues of Instance Selection

Data Mining and Knowledge Discovery
Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?

Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Combining Statistical Learning with a Knowledge-Based Approach - A Case Study in Intensive Care Monitoring

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Integrating feature and instance selection for text classification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Boosting support vector machines for text classification through parameter-free threshold relaxation

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
In Defense of One-Vs-All Classification

The Journal of Machine Learning Research
Feature selection for text categorization on imbalanced data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Blocking Reduction Strategies in Hierarchical Text Classification

IEEE Transactions on Knowledge and Data Engineering
KBA: Kernel Boundary Alignment Considering Imbalanced Data Distribution

IEEE Transactions on Knowledge and Data Engineering
Introducing a Family of Linear Measures for Feature Selection in Text Categorization

IEEE Transactions on Knowledge and Data Engineering
A support vector method for multivariate performance measures

ICML '05 Proceedings of the 22nd international conference on Machine learning
An Unsupervised Learning Approach to Resolving the Data Imbalanced Issue in Supervised Learning Problems in Functional Genomics

HIS '05 Proceedings of the Fifth International Conference on Hybrid Intelligent Systems
The relationship between Precision-Recall and ROC curves

ICML '06 Proceedings of the 23rd international conference on Machine learning
Exploratory Under-Sampling for Class-Imbalance Learning

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
An integrated two-stage model for intelligent information routing

Decision Support Systems
An intelligent information agent for document title classification and filtering in document-intensive domains

Decision Support Systems
The class imbalance problem: A systematic study

Intelligent Data Analysis
A machine learning approach to web page filtering using content and structure analysis

Decision Support Systems
Imbalanced text classification: A term weighting approach

Expert Systems with Applications: An International Journal
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Automatic online news monitoring and classification for syndromic surveillance

Decision Support Systems
FISA: feature-based instance selection for imbalanced text classification

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining

ROLEX-SP: Rules of lexical syntactic patterns for free text categorization

Knowledge-Based Systems
Utterance partitioning with acoustic vector resampling for GMM-SVM speaker verification

Speech Communication
Combining integrated sampling with SVM ensembles for learning from imbalanced datasets

Information Processing and Management: an International Journal
Re-mining item associations: Methodology and a case study in apparel retailing

Decision Support Systems
Towards the taxonomy-oriented categorization of yellow pages queries

ACM Transactions on Internet Technology (TOIT)
Using the absolute difference of term occurrence probabilities in binary text categorization

Applied Intelligence
Preprocessing unbalanced data using support vector machine

Decision Support Systems
A normal distribution-based over-sampling approach to imbalanced data classification

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part I
On developing robust models for favourability analysis: Model choice, feature sets and imbalanced data

Decision Support Systems
Sample cutting method for imbalanced text sentiment classification based on BRC

Knowledge-Based Systems
Multiple extreme learning machines for a two-class imbalance corporate life cycle prediction

Knowledge-Based Systems
Going-concern prediction using hybrid random forests and rough set approach

Information Sciences: an International Journal
Exploiting poly-lingual documents for improving text categorization effectiveness

Decision Support Systems
GSVM: An SVM for handling imbalanced accuracy between classes inbi-classification problems

Applied Soft Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many real-world text classification tasks involve imbalanced training examples. The strategies proposed to address the imbalanced classification (e.g., resampling, instance weighting), however, have not been systematically evaluated in the text domain. In this paper, we conduct a comparative study on the effectiveness of these strategies in the context of imbalanced text classification using Support Vector Machines (SVM) classifier. SVM is the interest in this study for its good classification accuracy reported in many text classification tasks. We propose a taxonomy to organize all proposed strategies following the training and the test phases in text classification tasks. Based on the taxonomy, we survey the methods proposed to address the imbalanced classification. Among them, 10 commonly-used methods were evaluated in our experiments on three benchmark datasets, i.e., Reuters-21578, 20-Newsgroups, and WebKB. Using the area under the Precision-Recall Curve as the performance measure, our experimental results showed that the best decision surface was often learned by the standard SVM, not coupled with any of the proposed strategies. We believe such a negative finding will benefit both researchers and application developers in the area by focusing more on thresholding strategies.