A study of the behavior of several methods for balancing machine learning training data

Authors:
Gustavo E. A. P. A. Batista;Ronaldo C. Prati;Maria Carolina Monard
Affiliations:
Instituto de Ciências Matemáticas e de Computação, São Carlos - SP, Brazil;Instituto de Ciências Matemáticas e de Computação, São Carlos - SP, Brazil;Instituto de Ciências Matemáticas e de Computação, São Carlos - SP, Brazil
Venue:
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Year:
2004

Citing 13
Cited 151

Toward memory-based reasoning

Communications of the ACM - Special issue on parallelism
C4.5: programs for machine learning

C4.5: programs for machine learning
MetaCost: a general method for making classifiers cost-sensitive

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Reduction Techniques for Instance-BasedLearning Algorithms

Machine Learning
Learning and making decisions when costs and probabilities are both unknown

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Machine Learning

Machine Learning
An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants

Machine Learning
Learning Decision Trees Using the Area Under the ROC Curve

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Improving Identification of Difficult Small Classes by Balancing Class Distribution

AIME '01 Proceedings of the 8th Conference on AI in Medicine in Europe: Artificial Intelligence Medicine
The class imbalance problem: A systematic study

Intelligent Data Analysis
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research

Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Wrapper-based computation and evaluation of sampling methods for imbalanced datasets

UBDM '05 Proceedings of the 1st international workshop on Utility-based data mining
Bias Analysis in Text Classification for Highly Skewed Data

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
A neural network based information granulation approach to shorten the cellular phone test process

Computers in Industry
Prediction of dose escalation for rheumatoid arthritis patients under infliximab treatment

Engineering Applications of Artificial Intelligence
Rough Sets for Handling Imbalanced Data: Combining Filtering and Rule-based Classifiers

Fundamenta Informaticae - SPECIAL ISSUE ON CONCURRENCY SPECIFICATION AND PROGRAMMING (CS&P 2005) Ruciane-Nide, Poland, 28-30 September 2005
Conflict-sensitivity contexture learning algorithm for mining interesting patterns using neuro-fuzzy network with decision rules

Expert Systems with Applications: An International Journal
On the Classification of a Small Imbalanced Cytogenetic Image Database

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Mining breast cancer data with XCS

Proceedings of the 9th annual conference on Genetic and evolutionary computation
Cost-sensitive boosting for classification of imbalanced data

Pattern Recognition
Intrusion detection in computer networks by a modular ensemble of one-class classifiers

Information Fusion
An Evaluation of the Robustness of MTS for Imbalanced Data

IEEE Transactions on Knowledge and Data Engineering
A weighted rough set based method developed for class imbalance learning

Information Sciences: an International Journal
Learning verb complements for modern greek: Balancing the noisy dataset

Natural Language Engineering
Borderline detection by Bayes vector quantizers

Proceedings of the 2008 ACM symposium on Applied computing
Detection of stock price movements using chance discovery and genetic programming

International Journal of Knowledge-based and Intelligent Engineering Systems - Chance discovery
An information granulation based data mining approach for classifying imbalanced data

Information Sciences: an International Journal
A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets

Fuzzy Sets and Systems
Automatically countering imbalance and its empirical relationship to cost

Data Mining and Knowledge Discovery
When Overlapping Unexpectedly Alters the Class Imbalance Effects

IbPRIA '07 Proceedings of the 3rd Iberian conference on Pattern Recognition and Image Analysis, Part II
An Analysis of the Rule Weights and Fuzzy Reasoning Methods for Linguistic Rule Based Classification Systems Applied to Problems with Highly Imbalanced Data Sets

WILF '07 Proceedings of the 7th international workshop on Fuzzy Logic and Applications: Applications of Fuzzy Sets Theory
Selective Pre-processing of Imbalanced Data for Improving Classification Performance

DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
Learning Decision Trees for Unbalanced Data

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Using granular computing model to induce scheduling knowledge in dynamic manufacturing environments

International Journal of Computer Integrated Manufacturing
Subgroup discover in large size data sets preprocessed using stratified instance selection for increasing the presence of minority classes

Pattern Recognition Letters
A comparative study on rough set based class imbalance learning

Knowledge-Based Systems
Web robot detection: A probabilistic reasoning approach

Computer Networks: The International Journal of Computer and Telecommunications Networking
On the use of surrounding neighbors for synthetic over-sampling of the minority class

SMO'08 Proceedings of the 8th conference on Simulation, modelling and optimization
Hierarchical fuzzy rule based classification systems with genetic rule selection for imbalanced data-sets

International Journal of Approximate Reasoning
MDS: a novel method for class imbalance learning

Proceedings of the 3rd International Conference on Ubiquitous Information Management and Communication
On the influence of an adaptive inference system in fuzzy rule based classification systems for imbalanced data-sets

Expert Systems with Applications: An International Journal
Using pre & post-processing methods to improve binding site predictions

Pattern Recognition
Classification of software behaviors for failure detection: a discriminative pattern mining approach

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Mind the gaps: weighting the unknown in large-scale one-class collaborative filtering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Use of Ensemble Based on GA for Imbalance Problem

ISNN 2009 Proceedings of the 6th International Symposium on Neural Networks: Advances in Neural Networks - Part II
A Preliminar Analysis of CO2RBFN in Imbalanced Problems

IWANN '09 Proceedings of the 10th International Work-Conference on Artificial Neural Networks: Part I: Bio-Inspired Systems: Computational and Ambient Intelligence
Consistency Measure of Multiple Classifiers for Land Cover Classification by Remote Sensing Image

MCS '09 Proceedings of the 8th International Workshop on Multiple Classifier Systems
A First Study on the Use of Interval-Valued Fuzzy Sets with Genetic Tuning for Classification with Imbalanced Data-Sets

HAIS '09 Proceedings of the 4th International Conference on Hybrid Artificial Intelligence Systems
Hybrid sampling for imbalanced data

Integrated Computer-Aided Engineering - Selected papers from the IEEE Conference on Information Reuse and Integration (IRI), July 13-15, 2008
A Weighted Rough Set Approach for Cost-Sensitive Learning

RSFDGrC '07 Proceedings of the 11th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing
Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems

Applied Soft Computing
Several SVM Ensemble Methods Integrated with Under-Sampling for Imbalanced Data Learning

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Handling Class Imbalance Problems via Weighted BP Algorithm

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
A Hybrid Approach Handling Imbalanced Datasets

ICIAP '09 Proceedings of the 15th International Conference on Image Analysis and Processing
A resource-poor approach for linking ontology classes to Wikipedia articles

STEP '08 Proceedings of the 2008 Conference on Semantics in Text Processing
Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy

Evolutionary Computation
Knowledge discovery from imbalanced and noisy data

Data & Knowledge Engineering
Agreement detection in multiparty conversation

Proceedings of the 2009 international conference on Multimodal interfaces
Exploratory undersampling for class-imbalance learning

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
A neural network based information granulation approach to shorten the cellular phone test process

Computers in Industry
Classifying Multiple Imbalanced Attributes in Relational Data

AI '09 Proceedings of the 22nd Australasian Joint Conference on Advances in Artificial Intelligence
Learning on class imbalanced data to classify peer-to-peer applications in IP traffic using resampling techniques

IJCNN'09 Proceedings of the 2009 international joint conference on Neural Networks
Diversity exploration and negative correlation learning on imbalanced data sets

IJCNN'09 Proceedings of the 2009 international joint conference on Neural Networks
SERA: selectively recursive approach towards nonstationary imbalanced stream data mining

IJCNN'09 Proceedings of the 2009 international joint conference on Neural Networks
On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets

Information Sciences: an International Journal
Facetwise analysis of XCS for problems with class imbalances

IEEE Transactions on Evolutionary Computation
A data-driven approach to manage the length of stay for appendectomy patients

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans
Selective costing ensemble for handling imbalanced data sets

International Journal of Hybrid Intelligent Systems
AdaOUBoost: adaptive over-sampling and under-sampling to boost the concept learning in large scale imbalanced data sets

Proceedings of the international conference on Multimedia information retrieval
On the discovery of subsumption relations for the alignment of ontologies

Web Semantics: Science, Services and Agents on the World Wide Web
A study on the use of the fuzzy reasoning method based on the winning rule vs. voting procedure for classification with imbalanced data sets

IWANN'07 Proceedings of the 9th international work conference on Artificial neural networks
Malware detection based on mining API calls

Proceedings of the 2010 ACM Symposium on Applied Computing
Taking class importance into account

ICHIT'06 Proceedings of the 1st international conference on Advances in hybrid information technology
Class-oriented reduction of decision tree complexity

ISMIS'08 Proceedings of the 17th international conference on Foundations of intelligent systems
Analyzing PETs on imbalanced datasets when training and testing class distributions differ

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Modeling radiation-induced lung injury risk with an ensemble of support vector machines

Neurocomputing
Transfer estimation of evolving class priors in data stream classification

Pattern Recognition
An unsupervised self-organizing learning with support vector ranking for imbalanced datasets

Expert Systems with Applications: An International Journal
Analysis of an evolutionary RBFN design algorithm, CO2RBFN, for imbalanced data sets

Pattern Recognition Letters
Language independent system for definition extraction: first results using learning algorithms

WDE '09 Proceedings of the 1st Workshop on Definition Extraction
Hierarchical service analytics for improving productivity in an enterprise service center

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
RAMOBoost: ranked minority oversampling in boosting

IEEE Transactions on Neural Networks
Genetics-based machine learning for rule induction: state of the art, taxonomy, and comparative study

IEEE Transactions on Evolutionary Computation
A simple approach to incorporate label dependency in multi-label classification

MICAI'10 Proceedings of the 9th Mexican international conference on Artificial intelligence conference on Advances in soft computing: Part II
Ensemble Learning with Active Example Selection for Imbalanced Biomedical Data Classification

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Learning without default: a study of one-class classification and the low-default portfolio problem

AICS'09 Proceedings of the 20th Irish conference on Artificial intelligence and cognitive science
Classification of imbalanced data by combining the complementary neural network and SMOTE algorithm

ICONIP'10 Proceedings of the 17th international conference on Neural information processing: models and applications - Volume Part II
Exploiting probabilistic topic models to improve text categorization under class imbalance

Information Processing and Management: an International Journal
Exploring the performance of resampling strategies for the class imbalance problem

IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part I
A multi-objective optimisation approach for class imbalance learning

Pattern Recognition
A dynamic over-sampling procedure based on sensitivity for multi-class problems

Pattern Recognition
Borderline over-sampling for imbalanced data classification

International Journal of Knowledge Engineering and Soft Data Paradigms
Linguistic cost-sensitive learning of genetic fuzzy classifiers for imprecise data

International Journal of Approximate Reasoning
An empirical analysis of under-sampling techniques to balance a protein structural class dataset

ICONIP'06 Proceedings of the 13th international conference on Neural information processing - Volume Part III
Software defect detection with rocus

Journal of Computer Science and Technology
Genetic algorithms as a pre processing strategy for imbalanced datasets

Proceedings of the 13th annual conference companion on Genetic and evolutionary computation
Addressing the classification with imbalanced data: open problems and new challenges on class distribution

HAIS'11 Proceedings of the 6th international conference on Hybrid artificial intelligent systems - Volume Part I
On the effectiveness of preprocessing methods when dealing with different levels of class imbalance

Knowledge-Based Systems
Evolutionary-based selection of generalized instances for imbalanced classification

Knowledge-Based Systems
Comparing alternative classifiers for database marketing: The case of imbalanced datasets

Expert Systems with Applications: An International Journal
Compact ensemble trees for imbalanced data

MCS'11 Proceedings of the 10th international conference on Multiple classifier systems
Incorporating label dependency into the binary relevance framework for multi-label classification

Expert Systems with Applications: An International Journal
An experimental comparison of classification algorithms for imbalanced credit scoring data sets

Expert Systems with Applications: An International Journal
C4.5 consolidation process: an alternative to intelligent oversampling methods in class imbalance problems

CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
Improving SVM training by means of NTIL when the data sets are imbalanced

ISMIS'06 Proceedings of the 16th international conference on Foundations of Intelligent Systems
Hellinger distance decision trees are robust and skew-insensitive

Data Mining and Knowledge Discovery
ISCSLP SR evaluation, UVA–CS_es system description. a system based on ANNs

ISCSLP'06 Proceedings of the 5th international conference on Chinese Spoken Language Processing
Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

ICIC'05 Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I
Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics

Expert Systems with Applications: An International Journal
Overlap versus imbalance

AI'10 Proceedings of the 23rd Canadian conference on Advances in Artificial Intelligence
Generating diverse ensembles to counter the problem of class imbalance

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
A novel synthetic minority oversampling technique for imbalanced data set learning

ICONIP'11 Proceedings of the 18th international conference on Neural Information Processing - Volume Part II
Combined effects of class imbalance and class overlap on instance-based classification

IDEAL'06 Proceedings of the 7th international conference on Intelligent Data Engineering and Automated Learning
A proposal of evolutionary prototype selection for class imbalance problems

IDEAL'06 Proceedings of the 7th international conference on Intelligent Data Engineering and Automated Learning
Improving the MLP learning by using a method to calculate the initial weights of the network based on the quality of similarity measure

MICAI'11 Proceedings of the 10th international conference on Artificial Intelligence: advances in Soft Computing - Volume Part II
Preprocessing unbalanced data using support vector machine

Decision Support Systems
An efficient ensemble method for classifying skewed data streams

ICIC'11 Proceedings of the 7th international conference on Intelligent Computing: bio-inspired computing and applications
A normal distribution-based over-sampling approach to imbalanced data classification

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part I
Mixed-sampling approach to unbalanced data distributions: a case study involving Leukemia's document profiling

WSEAS Transactions on Information Science and Applications
DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique

Applied Intelligence
Inverse random under sampling for class imbalance problem and its application to multi-label classification

Pattern Recognition
A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems

Neurocomputing
Identifying the medical practice after total hip arthroplasty using an integrated hybrid approach

Computers in Biology and Medicine
Towards improving automatic image annotation using improvised fractal SMOTE approach

Proceedings of the International Conference on Advances in Computing, Communications and Informatics
Comparing the performance of fault prediction models which report multiple performance measures: recomputing the confusion matrix

Proceedings of the 8th International Conference on Predictive Models in Software Engineering
Extensions of ant-miner algorithm to deal with class imbalance problem

IDEAL'12 Proceedings of the 13th international conference on Intelligent Data Engineering and Automated Learning
Rough Sets for Handling Imbalanced Data: Combining Filtering and Rule-based Classifiers

Fundamenta Informaticae - SPECIAL ISSUE ON CONCURRENCY SPECIFICATION AND PROGRAMMING (CS&P 2005) Ruciane-Nide, Poland, 28-30 September 2005
Using Machine Learning Classifiers to Assist Healthcare-Related Decisions: Classification of Electronic Patient Records

Journal of Medical Systems
BRACID: a comprehensive approach to learning rules from imbalanced data

Journal of Intelligent Information Systems
DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets

Data & Knowledge Engineering
Synthetic pattern generation for imbalanced learning in image retrieval

Pattern Recognition Letters
An efficient and simple under-sampling technique for imbalanced time series classification

Proceedings of the 21st ACM international conference on Information and knowledge management
Improving ANNs performance on unbalanced data with an AUC-Based learning algorithm

ICANN'12 Proceedings of the 22nd international conference on Artificial Neural Networks and Machine Learning - Volume Part II
A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline data-sets

Knowledge-Based Systems
Multiple extreme learning machines for a two-class imbalance corporate life cycle prediction

Knowledge-Based Systems
SNEOM: a sanger network based extended over-sampling method. application to imbalanced biomedical datasets

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part IV
One-sided prototype selection on class imbalanced dissimilarity matrices

SSPR'12/SPR'12 Proceedings of the 2012 Joint IAPR international conference on Structural, Syntactic, and Statistical Pattern Recognition
Improving risk predictions by preprocessing imbalanced credit data

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part II
Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches

Knowledge-Based Systems
A new framework for optimal classifier design

Pattern Recognition
A vector-valued support vector machine model for multiclass problem

Information Sciences: an International Journal
A new probabilistic active sample selection algorithm for class imbalance problem

International Journal of Knowledge Engineering and Soft Data Paradigms
Charting the digital library evaluation domain with a semantically enhanced mining methodology

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Novel classifier scheme for imbalanced problems

Pattern Recognition Letters
Reverse-engineering conference rankings: what does it take to make a reputable conference?

Scientometrics
EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling

Pattern Recognition
Evaluation of sampling methods for learning from imbalanced data

ICIC'13 Proceedings of the 9th international conference on Intelligent Computing Theories
Class imbalance and the curse of minority hubs

Knowledge-Based Systems
Training and assessing classification rules with imbalanced data

Data Mining and Knowledge Discovery
Addressing imbalanced classification with instance generation techniques: IPADE-ID

Neurocomputing
On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed

Information Sciences: an International Journal
MetaStream: A meta-learning based method for periodic algorithm selection in time-changing data

Neurocomputing
Predicting minority class for suspended particulate matters level by extreme learning machine

Neurocomputing
A novel framework for concept detection on large scale video database and feature pool

Artificial Intelligence Review
Enhanced and hierarchical structure algorithm for data imbalance problem in semantic extraction under massive video dataset

Multimedia Tools and Applications
Improving predictive models of glaucoma severity by incorporating quality indicators

Artificial Intelligence in Medicine
Irrelevant attributes and imbalanced classes in multi-label text-categorization domains

Intelligent Data Analysis
IIvotes ensemble for imbalanced data

Intelligent Data Analysis - Combined Learning Methods and Mining Complex Data
Evaluation of a new hybrid algorithm for highly imbalanced classification problems

International Journal of Hybrid Intelligent Systems
A combined approach to tackle imbalanced data sets

International Journal of Hybrid Intelligent Systems
DConfusion: a technique to allow cross study performance evaluation of fault prediction studies

Automated Software Engineering

Quantified Score

Hi-index	0.01

Visualization

Abstract

There are several aspects that might influence the performance achieved by existing learning systems. It has been reported that one of these aspects is related to class imbalance in which examples in training data belonging to one class heavily outnumber the examples in the other class. In this situation, which is found in real world data describing an infrequent but important event, the learning system may have difficulties to learn the concept related to the minority class. In this work we perform a broad experimental evaluation involving ten methods, three of them proposed by the authors, to deal with the class imbalance problem in thirteen UCI data sets. Our experiments provide evidence that class imbalance does not systematically hinder the performance of learning systems. In fact, the problem seems to be related to learning with too few minority class examples in the presence of other complicating factors, such as class overlapping. Two of our proposed methods deal with these conditions directly, allying a known over-sampling method with data cleaning methods in order to produce better-defined class clusters. Our comparative experiments show that, in general, over-sampling methods provide more accurate results than under-sampling methods considering the area under the ROC curve (AUC). This result seems to contradict results previously published in the literature. Two of our proposed methods, Smote + Tomek and Smote + ENN, presented very good results for data sets with a small number of positive examples. Moreover, Random over-sampling, a very simple over-sampling method, is very competitive to more complex over-sampling methods. Since the over-sampling methods provided very good performance results, we also measured the syntactic complexity of the decision trees induced from over-sampled data. Our results show that these trees are usually more complex then the ones induced from original data. Random over-sampling usually produced the smallest increase in the mean number of induced rules and Smote + ENN the smallest increase in the mean number of conditions per rule, when compared among the investigated over-sampling methods.