Automatically countering imbalance and its empirical relationship to cost

Authors:
Nitesh V. Chawla;David A. Cieslak;Lawrence O. Hall;Ajay Joshi
Affiliations:
Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, USA 46556;Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, USA 46556;Department of Computer Science and Engineering, University of South Florida, Tampa, USA 33620-5399;Department of Computer Science and Engineering, University of South Florida, Tampa, USA 33620-5399
Venue:
Data Mining and Knowledge Discovery
Year:
2008

Citing 22
Cited 21

C4.5: programs for machine learning

C4.5: programs for machine learning
Bagging predictors

Machine Learning
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Machine Learning for the Detection of Oil Spills in Satellite Radar Images

Machine Learning - Special issue on applications of machine learning and the knowledge discovery process
MetaCost: a general method for making classifiers cost-sensitive

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning and making decisions when costs and probabilities are both unknown

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
The Case against Accuracy Estimation for Comparing Induction Algorithms

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Tree Induction for Probability-Based Ranking

Machine Learning
Cost-Sensitive Learning by Cost-Proportionate Example Weighting

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Naive Bayes vs decision trees in intrusion detection systems

Proceedings of the 2004 ACM symposium on Applied computing
Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
A study of the behavior of several methods for balancing machine learning training data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Distributed computing in practice: the Condor experience: Research Articles

Concurrency and Computation: Practice & Experience - Grid Performance
Wrapper-based computation and evaluation of sampling methods for imbalanced datasets

UBDM '05 Proceedings of the 1st international workshop on Utility-based data mining
Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem

IEEE Transactions on Knowledge and Data Engineering
Cost curves: An improved method for visualizing classifier performance

Machine Learning
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research
The foundations of cost-sensitive learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Ensembles of classifiers from spatially disjoint data

MCS'05 Proceedings of the 6th international conference on Multiple Classifier Systems

Guest editorial: special issue on utility-based data mining

Data Mining and Knowledge Discovery
Learning Decision Trees for Unbalanced Data

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems

Applied Soft Computing
Knowledge discovery from imbalanced and noisy data

Data & Knowledge Engineering
Improving software-quality predictions with data sampling and boosting

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans
Analyzing PETs on imbalanced datasets when training and testing class distributions differ

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Supervised neural network modeling: an empirical investigation into learning from imbalanced data with labeling errors

IEEE Transactions on Neural Networks
Learning with cost intervals

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
RAMOBoost: ranked minority oversampling in boosting

IEEE Transactions on Neural Networks
Borderline over-sampling for imbalanced data classification

International Journal of Knowledge Engineering and Soft Data Paradigms
Classifying severely imbalanced data

Canadian AI'11 Proceedings of the 24th Canadian conference on Advances in artificial intelligence
Evolutionary-based selection of generalized instances for imbalanced classification

Knowledge-Based Systems
Ensembles of decision trees for imbalanced data

MCS'11 Proceedings of the 10th international conference on Multiple classifier systems
Hellinger distance decision trees are robust and skew-insensitive

Data Mining and Knowledge Discovery
Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics

Expert Systems with Applications: An International Journal
Generating diverse ensembles to counter the problem of class imbalance

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems

Neurocomputing
A modified back-propagation algorithm to deal with severe two-class imbalance problems on neural networks

MCPR'12 Proceedings of the 4th Mexican conference on Pattern Recognition
EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling

Pattern Recognition
Evaluation of sampling methods for learning from imbalanced data

ICIC'13 Proceedings of the 9th international conference on Intelligent Computing Theories
Editorial: Parameter-free classification in multi-class imbalanced data sets

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Learning from imbalanced data sets presents a convoluted problem both from the modeling and cost standpoints. In particular, when a class is of great interest but occurs relatively rarely such as in cases of fraud, instances of disease, and regions of interest in large-scale simulations, there is a correspondingly high cost for the misclassification of rare events. Under such circumstances, the data set is often re-sampled to generate models with high minority class accuracy. However, the sampling methods face a common, but important, criticism: how to automatically discover the proper amount and type of sampling? To address this problem, we propose a wrapper paradigm that discovers the amount of re-sampling for a data set based on optimizing evaluation functions like the f-measure, Area Under the ROC Curve (AUROC), cost, cost-curves, and the cost dependent f-measure. Our analysis of the wrapper is twofold. First, we report the interaction between different evaluation and wrapper optimization functions. Second, we present a set of results in a cost- sensitive environment, including scenarios of unknown or changing cost matrices. We also compared the performance of the wrapper approach versus cost-sensitive learning methods--MetaCost and the Cost-Sensitive Classifiers--and found the wrapper to outperform the cost-sensitive classifiers in a cost-sensitive environment. Lastly, we obtained the lowest cost per test example compared to any result we are aware of for the KDD-99 Cup intrusion detection data set.