Undersampling approach for imbalanced training sets and induction from multi-label text-categorization domains

Authors:
Sareewan Dendamrongvit;Miroslav Kubat
Affiliations:
Department of Electrical & Computer Engineering, University of Miami, Coral Gables, FL;Department of Electrical & Computer Engineering, University of Miami, Coral Gables, FL
Venue:
PAKDD'09 Proceedings of the 13th Pacific-Asia international conference on Knowledge discovery and data mining: new frontiers in applied data mining
Year:
2009

Citing 13
Cited 0

The Strength of Weak Learnability

Machine Learning
C4.5: programs for machine learning

C4.5: programs for machine learning
Bayesian Network Classifiers

Machine Learning - Special issue on learning with probabilistic representations
Improved Boosting Algorithms Using Confidence-rated Predictions

Machine Learning - The Eleventh Annual Conference on computational Learning Theory
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Information Retrieval

Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Knowledge Discovery in Multi-label Phenotype Data

PKDD '01 Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
An adaptive k-nearest neighbor text categorization strategy

ACM Transactions on Asian Language Information Processing (TALIP)
Combining Subclassifiers in Text Categorization: A DST-Based Solution and a Case Study

IEEE Transactions on Knowledge and Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text categorization is an important application domain of multilabel classification where each document can simultaneously belong to more than one class. The most common approach is to address the problem of multi-label examples by inducing a separate binary classifier for each class, and then use these classifiers in parallel. What the information-retrieval community has all but ignored, however, is that such classifiers are almost always induced from highly imbalanced training sets. The study reported in this paper shows how taking this aspect into consideration with a majority-class undersampling we used here can indeed improve classification performance as measured by criteria common in text categorization: macro/micro precision, recall, and F1. We also show how a slight modification of an older undersampling technique helps further improve the results.