Statistical analysis with missing data
Statistical analysis with missing data
Software metrics (2nd ed.): a rigorous and practical approach
Software metrics (2nd ed.): a rigorous and practical approach
Experimentation in software engineering: an introduction
Experimentation in software engineering: an introduction
Efficient algorithms for mining outliers from large data sets
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Machine Learning
Credit Scoring and Its Applications
Credit Scoring and Its Applications
Classification of Fault-Prone Software Modules: Prior Probabilities,Costs, and Model Evaluation
Empirical Software Engineering
The Case against Accuracy Estimation for Comparing Induction Algorithms
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Experiments with Noise Filtering in a Medical Domain
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
AdaCost: Misclassification Cost-Sensitive Boosting
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements
ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Comparative Assessment of Software Quality Classification Techniques: An Empirical Case Study
Empirical Software Engineering
Mining with rarity: a unifying framework
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
A study of the behavior of several methods for balancing machine learning training data
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Class imbalances versus small disjuncts
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
The Necessity of Assuring Quality in Software Measurement Data
METRICS '04 Proceedings of the Software Metrics, 10th International Symposium
Cost-Guided Class Noise Handling for Effective Cost-Sensitive Learning
ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Class noise vs. attribute noise: a quantitative study of their impacts
Artificial Intelligence Review
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
The pairwise attribute noise detection algorithm
Knowledge and Information Systems - Special Issue on Mining Low-Quality Data
Enhancing software quality estimation using ensemble-classifier based noise filtering
Intelligent Data Analysis
Detecting noisy instances with the rule-based classification model
Intelligent Data Analysis
Enhancing Reliability throughout Knowledge Discovery Process
ICDMW '06 Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops
Reducing performance Bias for Unbalanced Text Mining
ICDMW '06 Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops
Experimental perspectives on learning from imbalanced data
Proceedings of the 24th international conference on Machine learning
Cost-sensitive boosting for classification of imbalanced data
Pattern Recognition
The class imbalance problem: A systematic study
Intelligent Data Analysis
Skewed Class Distributions and Mislabeled Examples
ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
Learning with Limited Minority Class Data
ICMLA '07 Proceedings of the Sixth International Conference on Machine Learning and Applications
Fuzzy relevance vector machine for learning from unbalanced data and noise
Pattern Recognition Letters
Automatically countering imbalance and its empirical relationship to cost
Data Mining and Knowledge Discovery
Mining Impact-Targeted Activity Patterns in Imbalanced Data
IEEE Transactions on Knowledge and Data Engineering
Class noise detection using frequent itemsets
Intelligent Data Analysis
SMOTE: synthetic minority over-sampling technique
Journal of Artificial Intelligence Research
Learning when training data are costly: the effect of class distribution on tree induction
Journal of Artificial Intelligence Research
Identifying and eliminating mislabeled training instances
AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1
Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning
ICIC'05 Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I
Modeling the evolution of associated data
Data & Knowledge Engineering
Data preparation techniques for improving rare class prediction
MAMECTIS/NOLASC/CONTROL/WAMUS'11 Proceedings of the 13th WSEAS international conference on mathematical methods, computational techniques and intelligent systems, and 10th WSEAS international conference on non-linear analysis, non-linear systems and chaos, and 7th WSEAS international conference on dynamical systems and control, and 11th WSEAS international conference on Wavelet analysis and multirate systems: recent researches in computational techniques, non-linear systems and control
Predicting high-risk program modules by selecting the right software measurements
Software Quality Control
An investigation on the feasibility of cross-project defect prediction
Automated Software Engineering
Noisy data elimination using mutual k-nearest neighbor for classification mining
Journal of Systems and Software
Using a shallow linguistic kernel for drug-drug interaction extraction
Journal of Biomedical Informatics
Measuring stability of feature ranking techniques: a noise-based approach
International Journal of Business Intelligence and Data Mining
International Journal of Business Intelligence and Data Mining
Decision trees: a recent overview
Artificial Intelligence Review
Editorial: Parameter-free classification in multi-class imbalanced data sets
Data & Knowledge Engineering
Hi-index | 0.00 |
Class imbalance and labeling errors present significant challenges to data mining and knowledge discovery applications. Some previous work has discussed these important topics, however the relationship between these two issues has not received enough attention. Further, much of the previous work in this domain is fragmented and contradictory, leading to serious questions regarding the reliability and validity of the empirical conclusions. In response to these issues, we present a comprehensive suite of experiments carefully designed to provide conclusive, reliable, and significant results on the problem of learning from noisy and imbalanced data. Noise is shown to significantly impact all of the learners considered in this work, and a particularly important factor is the class in which the noise is located (which, as discussed throughout this work, has very important implications to noise handling). The impacts of noise, however, vary dramatically depending on the learning algorithm and simple algorithms such as naive Bayes and nearest neighbor learners are often more robust than more complex learners such as support vector machines or random forests. Sampling techniques, which are often used to alleviate the adverse impacts of imbalanced data, are shown to improve the performance of learners built from noisy and imbalanced data. In particular, simple sampling techniques such as random undersampling are generally the most effective.