An algorithm for correcting mislabeled data

Authors:
Xinchuan Zeng;Tony R. Martinez
Affiliations:
Computer Science Department, Brigham Young University, Provo, UT 84602, USA. E-mail: {zengx,martinez}@cs.byu.edu;Computer Science Department, Brigham Young University, Provo, UT 84602, USA. E-mail: {zengx,martinez}@cs.byu.edu
Venue:
Intelligent Data Analysis
Year:
2001

Citing 10
Cited 3

Multilayer feedforward networks are universal approximators

Neural Networks
Instance-Based Learning Algorithms

Machine Learning
C4.5: programs for machine learning

C4.5: programs for machine learning
Reduction Techniques for Instance-BasedLearning Algorithms

Machine Learning
Instance Pruning Techniques

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Correcting Noisy Data

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Noise Elimination in Inductive Concept Learning: A Case Study in Medical Diagnosois

ALT '96 Proceedings of the 7th International Workshop on Algorithmic Learning Theory
A study of cross-validation and bootstrap for accuracy estimation and model selection

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Evaluating noise correction

PRICAI'00 Proceedings of the 6th Pacific Rim international conference on Artificial intelligence
Identifying and eliminating mislabeled training instances

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1

Learning from imperfect data

Applied Soft Computing
Class Noise Mitigation Through Instance Weighting

ECML '07 Proceedings of the 18th European conference on Machine Learning
Improving Text Classification Accuracy by Training Label Cleaning

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Reliable evaluation for the performance of classifiers depends on the quality of the data sets on which they are tested. During the collecting and recording of a data set, however, some noise may be introduced into the data, especially in various real-world environments, which can degrade the quality of the data set. In this paper, we present a novel approach, called ADE (automatic data enhancement), to correct mislabeled data in a data set. In addition to using multi-layer neural networks trained by backpropagation as the basic framework, ADE assigns each training pattern a class probability vector as its class label, in which each component represents the probability of the corresponding class. During training, ADE constantly updates the probability vector based on its difference from the output of the network. With this updating rule, the probability of a mislabeled class gradually becomes smaller while that of the correct class becomes larger, which eventually causes the correction of mislabeled data after a number of training epochs. We have tested ADE on a number of data sets drawn from the UCI data repository for nearest neighbor classifiers. The results show that for most data sets, when there exists mislabeled data, a classifier constructed using a training set corrected by ADE can achieve significantly higher accuracy than that without using ADE.