Knowledge discovery from imbalanced and noisy data

Authors:
Jason Van Hulse;Taghi Khoshgoftaar
Affiliations:
Empirical Software Engineering Laboratory, Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL 33431, United States;Empirical Software Engineering Laboratory, Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL 33431, United States
Venue:
Data & Knowledge Engineering
Year:
2009

Citing 38
Cited 13

Statistical analysis with missing data

Statistical analysis with missing data
Software metrics (2nd ed.): a rigorous and practical approach

Software metrics (2nd ed.): a rigorous and practical approach
Experimentation in software engineering: an introduction

Experimentation in software engineering: an introduction
Efficient algorithms for mining outliers from large data sets

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Random Forests

Machine Learning
Credit Scoring and Its Applications

Credit Scoring and Its Applications
Classification of Fault-Prone Software Modules: Prior Probabilities,Costs, and Model Evaluation

Empirical Software Engineering
The Case against Accuracy Estimation for Comparing Induction Algorithms

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Correcting Noisy Data

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Experiments with Noise Filtering in a Medical Domain

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
AdaCost: Misclassification Cost-Sensitive Boosting

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Comparative Assessment of Software Quality Classification Techniques: An Empirical Case Study

Empirical Software Engineering
Mining with rarity: a unifying framework

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
A study of the behavior of several methods for balancing machine learning training data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Class imbalances versus small disjuncts

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
The Necessity of Assuring Quality in Software Measurement Data

METRICS '04 Proceedings of the Software Metrics, 10th International Symposium
Cost-Guided Class Noise Handling for Effective Cost-Sensitive Learning

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Class noise vs. attribute noise: a quantitative study of their impacts

Artificial Intelligence Review
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
The pairwise attribute noise detection algorithm

Knowledge and Information Systems - Special Issue on Mining Low-Quality Data
Enhancing software quality estimation using ensemble-classifier based noise filtering

Intelligent Data Analysis
Detecting noisy instances with the rule-based classification model

Intelligent Data Analysis
Enhancing Reliability throughout Knowledge Discovery Process

ICDMW '06 Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops
Reducing performance Bias for Unbalanced Text Mining

ICDMW '06 Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops
Experimental perspectives on learning from imbalanced data

Proceedings of the 24th international conference on Machine learning
Cost-sensitive boosting for classification of imbalanced data

Pattern Recognition
The class imbalance problem: A systematic study

Intelligent Data Analysis
Skewed Class Distributions and Mislabeled Examples

ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
Learning with Limited Minority Class Data

ICMLA '07 Proceedings of the Sixth International Conference on Machine Learning and Applications
Fuzzy relevance vector machine for learning from unbalanced data and noise

Pattern Recognition Letters
Automatically countering imbalance and its empirical relationship to cost

Data Mining and Knowledge Discovery
Mining Impact-Targeted Activity Patterns in Imbalanced Data

IEEE Transactions on Knowledge and Data Engineering
Class noise detection using frequent itemsets

Intelligent Data Analysis
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research
Identifying and eliminating mislabeled training instances

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1
Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

ICIC'05 Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I

Modeling the evolution of associated data

Data & Knowledge Engineering
On the effectiveness of preprocessing methods when dealing with different levels of class imbalance

Knowledge-Based Systems
Data preparation techniques for improving rare class prediction

MAMECTIS/NOLASC/CONTROL/WAMUS'11 Proceedings of the 13th WSEAS international conference on mathematical methods, computational techniques and intelligent systems, and 10th WSEAS international conference on non-linear analysis, non-linear systems and chaos, and 7th WSEAS international conference on dynamical systems and control, and 11th WSEAS international conference on Wavelet analysis and multirate systems: recent researches in computational techniques, non-linear systems and control
Predicting high-risk program modules by selecting the right software measurements

Software Quality Control
An investigation on the feasibility of cross-project defect prediction

Automated Software Engineering
Noisy data elimination using mutual k-nearest neighbor for classification mining

Journal of Systems and Software
Using a shallow linguistic kernel for drug-drug interaction extraction

Journal of Biomedical Informatics
Measuring stability of feature ranking techniques: a noise-based approach

International Journal of Business Intelligence and Data Mining
Evaluation of the importance of data pre-processing order when combining feature selection and data sampling

International Journal of Business Intelligence and Data Mining
DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets

Data & Knowledge Engineering
Performance of corporate bankruptcy prediction models on imbalanced dataset: The effect of sampling methods

Knowledge-Based Systems
Decision trees: a recent overview

Artificial Intelligence Review
Editorial: Parameter-free classification in multi-class imbalanced data sets

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Class imbalance and labeling errors present significant challenges to data mining and knowledge discovery applications. Some previous work has discussed these important topics, however the relationship between these two issues has not received enough attention. Further, much of the previous work in this domain is fragmented and contradictory, leading to serious questions regarding the reliability and validity of the empirical conclusions. In response to these issues, we present a comprehensive suite of experiments carefully designed to provide conclusive, reliable, and significant results on the problem of learning from noisy and imbalanced data. Noise is shown to significantly impact all of the learners considered in this work, and a particularly important factor is the class in which the noise is located (which, as discussed throughout this work, has very important implications to noise handling). The impacts of noise, however, vary dramatically depending on the learning algorithm and simple algorithms such as naive Bayes and nearest neighbor learners are often more robust than more complex learners such as support vector machines or random forests. Sampling techniques, which are often used to alleviate the adverse impacts of imbalanced data, are shown to improve the performance of learners built from noisy and imbalanced data. In particular, simple sampling techniques such as random undersampling are generally the most effective.