Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach

Authors:
Hongyu Guo;Herna L. Viktor
Affiliations:
University of Ottawa, Ottawa, Ontario, Canada;University of Ottawa, Ottawa, Ontario, Canada
Venue:
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Year:
2004

Citing 11
Cited 57

C4.5: programs for machine learning

C4.5: programs for machine learning
A decision-theoretic generalization of on-line learning and an application to boosting

Journal of Computer and System Sciences - Special issue: 26th annual ACM symposium on the theory of computing & STOC'94, May 23–25, 1994, and second annual Europe an conference on computational learning theory (EuroCOLT'95), March 13–15, 1995
Machine Learning for the Detection of Oil Spills in Satellite Radar Images

Machine Learning - Special issue on applications of machine learning and the knowledge discovery process
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization

Machine Learning
Information Retrieval

Information Retrieval
The Case against Accuracy Estimation for Comparing Induction Algorithms

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
AdaCost: Misclassification Cost-Sensitive Boosting

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
A Comparative Study of Cost-Sensitive Boosting Algorithms

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
AdaBoosting Neural Networks: Application to on-line Character Recognition

ICANN '97 Proceedings of the 7th International Conference on Artificial Neural Networks
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research

Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Mining relational databases with multi-view learning

MRDM '05 Proceedings of the 4th international workshop on Multi-relational mining
Conflict-sensitivity contexture learning algorithm for mining interesting patterns using neuro-fuzzy network with decision rules

Expert Systems with Applications: An International Journal
Asymmetric boosting

Proceedings of the 24th international conference on Machine learning
Cost-sensitive boosting for classification of imbalanced data

Pattern Recognition
An Evaluation of the Robustness of MTS for Imbalanced Data

IEEE Transactions on Knowledge and Data Engineering
Learning verb complements for modern greek: Balancing the noisy dataset

Natural Language Engineering
AdaBoost with SVM-based component classifiers

Engineering Applications of Artificial Intelligence
An information granulation based data mining approach for classifying imbalanced data

Information Sciences: an International Journal
A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets

Fuzzy Sets and Systems
Improving Imbalanced Multidimensional Dataset Learner Performance with Artificial Data Generation: Density-Based Class-Boost Algorithm

ICDM '08 Proceedings of the 8th industrial conference on Advances in Data Mining: Medical Applications, E-Commerce, Marketing, and Theoretical Aspects
FLSOM with Different Rates for Classification in Imbalanced Datasets

ICANN '08 Proceedings of the 18th international conference on Artificial Neural Networks, Part I
Using granular computing model to induce scheduling knowledge in dynamic manufacturing environments

International Journal of Computer Integrated Manufacturing
Multirelational classification: a multiple view approach

Knowledge and Information Systems
Learning from Skewed Class Multi-relational Databases

Fundamenta Informaticae - Progress on Multi-Relational Data Mining
MDS: a novel method for class imbalance learning

Proceedings of the 3rd International Conference on Ubiquitous Information Management and Communication
A Combination Classification Algorithm Based on Outlier Detection and C4.5

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Handling Class Imbalance Problems via Weighted BP Algorithm

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy

Evolutionary Computation
SVMs modeling for highly imbalanced classification

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics - Special issue on human computing
Exploratory undersampling for class-imbalance learning

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Diversity exploration and negative correlation learning on imbalanced data sets

IJCNN'09 Proceedings of the 2009 international joint conference on Neural Networks
Improving software-quality predictions with data sampling and boosting

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans
Mining discriminative distance context of transcription factor binding sites on ChIP enriched regions

ISBRA'07 Proceedings of the 3rd international conference on Bioinformatics research and applications
Rectangular basis functions applied to imbalanced datasets

ICANN'07 Proceedings of the 17th international conference on Artificial neural networks
Boosting support vector machines for imbalanced data sets

ISMIS'08 Proceedings of the 17th international conference on Foundations of intelligent systems
A learning method for the class imbalance problem with medical data sets

Computers in Biology and Medicine
An improved AdaBoost algorithm for unbalanced classification data

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 1
FSVM-CIL: fuzzy support vector machines for class imbalance learning

IEEE Transactions on Fuzzy Systems - Special section on computing with words
Medical case retrieval from a committee of decision trees

IEEE Transactions on Information Technology in Biomedicine
RAMOBoost: ranked minority oversampling in boosting

IEEE Transactions on Neural Networks
Novel classification method for sensitive problems and uneven datasets based on neural networks and fuzzy logic

Applied Soft Computing
Comparison of metrics for feature selection in imbalanced text classification

Expert Systems with Applications: An International Journal
The imbalanced problem in morphological galaxy classification

CIARP'10 Proceedings of the 15th Iberoamerican congress conference on Progress in pattern recognition, image analysis, computer vision, and applications
Neural network classifers in arrears management

ICANN'05 Proceedings of the 15th international conference on Artificial neural networks: formal models and their applications - Volume Part II
Combining integrated sampling with SVM ensembles for learning from imbalanced datasets

Information Processing and Management: an International Journal
Margin-based over-sampling method for learning from imbalanced datasets

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
On the effectiveness of preprocessing methods when dealing with different levels of class imbalance

Knowledge-Based Systems
Clustering based bagging algorithm on imbalanced data sets

IUKM'11 Proceedings of the 2011 international conference on Integrated uncertainty in knowledge modelling and decision making
A direct boosting algorithm for the k-nearest neighbor classifier via local warping of the distance metric

Pattern Recognition Letters
Boosting prediction accuracy on imbalanced datasets with SVM ensembles

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Automated retraining methods for document classification and their parameter tuning

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

ICIC'05 Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I
Generating diverse ensembles to counter the problem of class imbalance

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
A novel synthetic minority oversampling technique for imbalanced data set learning

ICONIP'11 Proceedings of the 18th international conference on Neural Information Processing - Volume Part II
Preprocessing unbalanced data using support vector machine

Decision Support Systems
A new over-sampling approach: Random-SMOTE for learning from imbalanced data sets

KSEM'11 Proceedings of the 5th international conference on Knowledge Science, Engineering and Management
A normal distribution-based over-sampling approach to imbalanced data classification

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part I
Semi-supervised learning for imbalanced sentiment classification

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Learning from Skewed Class Multi-relational Databases

Fundamenta Informaticae - Progress on Multi-Relational Data Mining
Sample cutting method for imbalanced text sentiment classification based on BRC

Knowledge-Based Systems
Churn prediction in telecom using Random Forest and PSO based data balancing in combination with various feature selection strategies

Computers and Electrical Engineering
Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and AdaBoost techniques

Pattern Recognition Letters
Mining Data Streams with Skewed Distribution based on Ensemble Method

International Journal of Advanced Pervasive and Ubiquitous Computing
A critical assessment of imbalanced class distribution problem: The case of predicting freshmen student attrition

Expert Systems with Applications: An International Journal
Enhanced and hierarchical structure algorithm for data imbalance problem in semantic extraction under massive video dataset

Multimedia Tools and Applications
IIvotes ensemble for imbalanced data

Intelligent Data Analysis - Combined Learning Methods and Mining Complex Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Learning from imbalanced data sets, where the number of examples of one (majority) class is much higher than the others, presents an important challenge to the machine learning community. Traditional machine learning algorithms may be biased towards the majority class, thus producing poor predictive accuracy over the minority class. In this paper, we describe a new approach that combines boosting, an ensemble-based learning algorithm, with data generation to improve the predictive power of classifiers against imbalanced data sets consisting of two classes. In the DataBoost-IM method, hard examples from both the majority and minority classes are identified during execution of the boosting algorithm. Subsequently, the hard examples are used to separately generate synthetic examples for the majority and minority classes. The synthetic data are then added to the original training set, and the class distribution and the total weights of the different classes in the new training set are rebalanced. The DataBoost-IM method was evaluated, in terms of the F-measures, G-mean and overall accuracy, against seventeen highly and moderately imbalanced data sets using decision trees as base classifiers. Our results are promising and show that the DataBoost-IM method compares well in comparison with a base classifier, a standard benchmarking boosting algorithm and three advanced boosting-based algorithms for imbalanced data set. Results indicate that our approach does not sacrifice one class in favor of the other, but produces high predictions against both minority and majority classes.