Identifying and eliminating mislabeled training instances

Authors:
Carla E. Brodley;Mark A. Friedl
Affiliations:
School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN;Department of Geography and Center for Remote Sensing, Boston University, Boston, MA
Venue:
AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1
Year:
1996

Citing 6
Cited 44

Instance-Based Learning Algorithms

Machine Learning
Original Contribution: Stacked generalization

Neural Networks
C4.5: programs for machine learning

C4.5: programs for machine learning
Multivariate Decision Trees

Machine Learning
Neural Network Ensembles

IEEE Transactions on Pattern Analysis and Machine Intelligence
Induction of Decision Trees

Machine Learning

Analyzing Outliers Cautiously

IEEE Transactions on Knowledge and Data Engineering
Contribution of Dataset Reduction Techniques to Tree-Simplification and Knowledge Discovery

PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
Fast Outlier Detection in High Dimensional Spaces

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Decontamination of Training Samples for Supervised Pattern Recognition Methods

Proceedings of the Joint IAPR International Workshops on Advances in Pattern Recognition
Ensemble of Classifiers for Noise Detection in PoS Tagged Corpora

TDS '00 Proceedings of the Third International Workshop on Text, Speech and Dialogue
Identifying and Eliminating Irrelevant Instances Using Information Theory

AI '00 Proceedings of the 13th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence
A Noise Filtering Method for Inductive Concept Learning

AI '02 Proceedings of the 15th Conference of the Canadian Society for Computational Studies of Intelligence on Advances in Artificial Intelligence
Improving Classification by Removing or Relabeling Mislabeled Instances

ISMIS '02 Proceedings of the 13th International Symposium on Foundations of Intelligent Systems
Assessing and improving the quality of knowledge discovery data

Data warehousing and web engineering
Identifying and Handling Mislabelled Instances

Journal of Intelligent Information Systems
Stopping criterion for boosting based data reduction techniques: from binary to multiclass problem

The Journal of Machine Learning Research
Probabilistic Noise Identification and Data Cleaning

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
A Survey of Outlier Detection Methodologies

Artificial Intelligence Review
Class Noise vs. Attribute Noise: A Quantitative Study

Artificial Intelligence Review
Outlier Mining in Large High-Dimensional Data Sets

IEEE Transactions on Knowledge and Data Engineering
A model for handling approximate, noisy or incomplete labeling in text classification

ICML '05 Proceedings of the 22nd international conference on Machine learning
Class noise vs. attribute noise: a quantitative study of their impacts

Artificial Intelligence Review
Bridging Local and Global Data Cleansing: Identifying Class Noise in Large, Distributed Data Datasets

Data Mining and Knowledge Discovery
A machine learning perspective on the development of clinical decision support systems utilizing mass spectra of blood samples

Journal of Biomedical Informatics
Classification in the presence of class noise using a probabilistic Kernel Fisher method

Pattern Recognition
An algorithm for correcting mislabeled data

Intelligent Data Analysis
A boosting approach to remove class label noise

International Journal of Hybrid Intelligent Systems - Hybrid Intelligent systems in Ensembles
The multiple imputation quantitative noise corrector

Intelligent Data Analysis
Data sets and data quality in software engineering

Proceedings of the 4th international workshop on Predictor models in software engineering
Unsupervised data pruning for clustering of noisy data

Knowledge-Based Systems
Class Noise Mitigation Through Instance Weighting

ECML '07 Proceedings of the 18th European conference on Machine Learning
Conceptual equivalence for contrast mining in classification learning

Data & Knowledge Engineering
Efficiently learning the accuracy of labeling sources for selective sampling

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Robust support vector machine training via convex outlier ablation

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Maintenance by a Committee of Experts: The MACE Approach to Case-Base Maintenance

ICCBR '09 Proceedings of the 8th International Conference on Case-Based Reasoning: Case-Based Reasoning Research and Development
Class noise detection using frequent itemsets

Intelligent Data Analysis
Arguing from Experience to Classifying Noisy Data

DaWaK '09 Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery
Loss optimal monotone relabeling of noisy multi-criteria data sets

Information Sciences: an International Journal
Knowledge discovery from imbalanced and noisy data

Data & Knowledge Engineering
Empirical case studies in attribute noise detection

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews - Special issue on information reuse and integration
Detecting outliers using rule-based modeling for improving CBR-based software quality classification models

ICCBR'03 Proceedings of the 5th international conference on Case-based reasoning: Research and Development
Evaluating noise correction

PRICAI'00 Proceedings of the 6th Pacific Rim international conference on Artificial intelligence
Improving boosting by exploiting former assumptions

MCD'07 Proceedings of the 3rd ECML/PKDD international conference on Mining complex data
Sensitivity of different machine learning algorithms to noise

Journal of Computing Sciences in Colleges
Identifying mislabeled training data with the aid of unlabeled data

Applied Intelligence
Combining feature and example pruning by uncertainty minimization

UAI'00 Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence
PISA: A framework for multiagent classification using argumentation

Data & Knowledge Engineering
Improving Text Classification Accuracy by Training Label Cleaning

ACM Transactions on Information Systems (TOIS)
Impact of noise on credit risk prediction: Does data quality really matter?

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a new approach to identifying and eliminating mislabeled training instances. The goal of this technique is to improve classification accuracies produced by learning algorithms by improving the quality of the training data. The approach employs an ensemble of classifiers that serve as a filter for the training data. Using an n-fold cross validation, the training data is passed through the filter. Only instances that the filter classifies correctly are passed to the final learning algorithm. We present an empirical evaluation of the approach for the task of automated land cover mapping from remotely sensed data. Labeling error arises in these data from a multitude of sources including lack of consistency in the vegetation classification used, variable measurement techniques, and variation in the spatial sampling resolution. Our evaluation shows that for noise levels of less than 40%, filtering results in higher predictive accuracy than not filtering, and for levels of class noise less than or equal to 20% filtering allows the base-line accuracy to be retained. Our empirical results suggest that the ensemble filter approach is an effective method for identifying labeling errors, and further, that the approach will significantly benefit ongoing research to develop accurate and robust remote sensing-based methods to map land cover at global scales.