Identifying and Handling Mislabelled Instances

Authors:
Fabrice Muhlenbach;Stéphane Lallich;Djamel A. Zighed
Affiliations:
ERIC Laboratory, Lumière University (Lyon 2), Bâtiment L, 5. Avenue Pierre Mendès-France, 69676 Bron Cedex, France. fmuhlenb@univ-lyon2.fr;ERIC Laboratory, Lumière University (Lyon 2), Bâtiment L, 5. Avenue Pierre Mendès-France, 69676 Bron Cedex, France. lallich@univ-lyon2.fr;ERIC Laboratory, Lumière University (Lyon 2), Bâtiment L, 5. Avenue Pierre Mendès-France, 69676 Bron Cedex, France. zighed@univ-lyon2.fr
Venue:
Journal of Intelligent Information Systems
Year:
2004

Citing 9
Cited 15

Relaxation labelling algorithms-a review

Image and Vision Computing
Algorithms for clustering data

Algorithms for clustering data
Instance-Based Learning Algorithms

Machine Learning
Reduction Techniques for Instance-BasedLearning Algorithms

Machine Learning
Induction of Decision Trees

Machine Learning
Separability Index in Supervised Learning

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Improving Classification by Removing or Relabeling Mislabeled Instances

ISMIS '02 Proceedings of the 13th International Symposium on Foundations of Intelligent Systems
Distance-based outliers: algorithms and applications

The VLDB Journal — The International Journal on Very Large Data Bases
Identifying and eliminating mislabeled training instances

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1

Tri-Training: Exploiting Unlabeled Data Using Three Classifiers

IEEE Transactions on Knowledge and Data Engineering
Avoiding Boosting Overfitting by Removing Confusing Samples

ECML '07 Proceedings of the 18th European conference on Machine Learning
Class Noise Mitigation Through Instance Weighting

ECML '07 Proceedings of the 18th European conference on Machine Learning
Support Vector Machine for Outlier Detection in Breast Cancer Survivability Prediction

Advanced Web and NetworkTechnologies, and Applications
Improving classification accuracy using automatically extracted training data

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Relabeling algorithm for retrieval of noisy instances and improving prediction quality

Computers in Biology and Medicine
Semi-supervised learning based on nearest neighbor rule and cut edges

Knowledge-Based Systems
Edited AdaBoost by weighted kNN

Neurocomputing
A new co-training-style random forest for computer aided diagnosis

Journal of Intelligent Information Systems
Using semi-supervised learning for question classification

ICCPOL'06 Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges ahead
SETRED: self-training with editing

PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
A bootstrapping method for learning from heterogeneous data

FGIT'11 Proceedings of the Third international conference on Future Generation Information Technology
A novel inductive semi-supervised SVM with graph-based self-training

IScIDE'12 Proceedings of the third Sino-foreign-interchange conference on Intelligent Science and Intelligent Data Engineering
Semi-supervised multi-label image classification based on nearest neighbor editing

Neurocomputing
On the characterization of noise filters for self-training semi-supervised in nearest neighbor classification

Neurocomputing

Quantified Score

Hi-index	0.01

Visualization

Abstract

Data mining and knowledge discovery aim at producing useful and reliable models from the data. Unfortunately some databases contain noisy data which perturb the generalization of the models. An important source of noise consists of mislabelled training instances. We offer a new approach which deals with improving classification accuracies by using a preliminary filtering procedure. An example is suspect when in its neighbourhood defined by a geometrical graph the proportion of examples of the same class is not significantly greater than in the database itself. Such suspect examples in the training data can be removed or relabelled. The filtered training set is then provided as input to learning algorithms. Our experiments on ten benchmarks of UCI Machine Learning Repository using 1-NN as the final algorithm show that removal gives better results than relabelling. Removing allows maintaining the generalization error rate when we introduce from 0 to 20% of noise on the class, especially when classes are well separable. The filtering method proposed is finally compared to the relaxation relabelling schema.