A study of the effect of different types of noise on the precision of supervised learning techniques

Authors:
David F. Nettleton;Albert Orriols-Puig;Albert Fornells
Affiliations:
Department of Technology, Pompeu Fabra University, Barcelona, Spain 08018 and Grup de Recerca en Sistemes Intel·ligents, Enginyeria i Arquitectura La Salle, Universitat Ramon Llull, Barcelona ...;Grup de Recerca en Sistemes Intel·ligents, Enginyeria i Arquitectura La Salle, Universitat Ramon Llull, Barcelona, Spain 08022;Grup de Recerca en Sistemes Intel·ligents, Enginyeria i Arquitectura La Salle, Universitat Ramon Llull, Barcelona, Spain 08022
Venue:
Artificial Intelligence Review
Year:
2010

Citing 16
Cited 5

Types of noise in data for concept learning

COLT '88 Proceedings of the first annual workshop on Computational learning theory
Instance-Based Learning Algorithms

Machine Learning
Tolerating noisy, irrelevant and novel attributes in instance-based learning algorithms

International Journal of Man-Machine Studies - Special issue: symbolic problem solving in noisy and novel task environments
C4.5: programs for machine learning

C4.5: programs for machine learning
Four types of noise in data for PAC learning

Information Processing Letters
The nature of statistical learning theory

The nature of statistical learning theory
Efficient noise-tolerant learning from statistical queries

Journal of the ACM (JACM)
Fast training of support vector machines using sequential minimal optimization

Advances in kernel methods
Induction of Decision Trees

Machine Learning
Learning From Noisy Examples

Machine Learning
Class Noise vs. Attribute Noise: A Quantitative Study

Artificial Intelligence Review
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Top 10 algorithms in data mining

Knowledge and Information Systems
Noise-tolerant windowing

IJCAI'97 Proceedings of the Fifteenth international joint conference on Artifical intelligence - Volume 2
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Estimating continuous distributions in Bayesian classifiers

UAI'95 Proceedings of the Eleventh conference on Uncertainty in artificial intelligence

Optimising Flash non-volatile memory using machine learning: a project overview

Proceedings of the Fifth Balkan Conference in Informatics
Investigation of random subspace and random forest regression models using data with injected noise

KES'12 Proceedings of the 16th international conference on Knowledge Engineering, Machine Learning and Lattice Computing with Applications
Tackling the problem of classification with noisy data using Multiple Classifier Systems: Analysis of the performance and robustness

Information Sciences: an International Journal
Unlearning from demonstration

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Analysis and extension of decision trees based on imprecise probabilities: Application on noisy data

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Machine learning techniques often have to deal with noisy data, which may affect the accuracy of the resulting data models. Therefore, effectively dealing with noise is a key aspect in supervised learning to obtain reliable models from data. Although several authors have studied the effect of noise for some particular learners, comparisons of its effect among different learners are lacking. In this paper, we address this issue by systematically comparing how different degrees of noise affect four supervised learners that belong to different paradigms. Specifically, we consider the Naïve Bayes probabilistic classifier, the C4.5 decision tree, the IBk instance-based learner and the SMO support vector machine. We have selected four methods which enable us to contrast different learning paradigms, and which are considered to be four of the top ten algorithms in data mining (Yu et al. 2007). We test them on a collection of data sets that are perturbed with noise in the input attributes and noise in the output class. As an initial hypothesis, we assign the techniques to two groups, NB with C4.5 and IBk with SMO, based on their proposed sensitivity to noise, the first group being the least sensitive. The analysis enables us to extract key observations about the effect of different types and degrees of noise on these learning techniques. In general, we find that Naïve Bayes appears as the most robust algorithm, and SMO the least, relative to the other two techniques. However, we find that the underlying empirical behavior of the techniques is more complex, and varies depending on the noise type and the specific data set being processed. In general, noise in the training data set is found to give the most difficulty to the learners.