Purifying data by machine learning with certainty levels

Authors:
Shlomi Dolev;Guy Leshem;Reuven Yagel
Affiliations:
Ben Gurion University, Israel;Ben Gurion University, Israel;Ben Gurion University, Israel
Venue:
Proceedings of the Third International Workshop on Reliability, Availability, and Security
Year:
2010

Citing 12
Cited 0

A theory of the learnable

Communications of the ACM
Learning in the presence of malicious errors

SIAM Journal on Computing
Statistical queries and faulty PAC oracles

COLT '93 Proceedings of the sixth annual conference on Computational learning theory
Learning nested differences in the presence of malicious noise

Theoretical Computer Science - Special issue on algorithmic learning theory
Specification and simulation of statistical query algorithms for efficiency and noise tolerance

Journal of Computer and System Sciences - Special issue on the eighth annual workshop on computational learning theory, July 5–8, 1995
Learning conjuctions with noise under product distributions

Information Processing Letters
Sample-efficient strategies for learning in the presence of noise

Journal of the ACM (JACM)
Machine Learning

Machine Learning
C4.5: Programs for Machine Learning

C4.5: Programs for Machine Learning
On-line learning with malicious noise and the closure algorithm

Annals of Mathematics and Artificial Intelligence
Induction of Decision Trees

Machine Learning
Smooth boosting and learning with malicious noise

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

A fundamental paradigm used for autonomic computing, self-managing systems, and decision-making under uncertainty and faults is machine learning. Machine learning uses a data-set, or a set of data-items. A data-item is a vector of feature values and a classification. Occasionally these data sets include misleading data items that were either introduced by input device malfunctions, or were maliciously inserted to lead the machine learning to wrong conclusions. A reliable learning algorithm must be able to handle a corrupted data-set. Otherwise, an adversary (or simply a malfunctioning input device that corrupts a portion of the data-set) may lead to inaccurate classifications. Therefore, the challenge is to find effective methods to evaluate and increase the certainty level of the learning process as much as possible. This paper introduces the use of a certainty level measure to obtain better classification capability in the presence of corrupted data items. Assuming a known data distribution (e.g., a normal distribution) and/or a known upper bound on the given number of corrupted data items, our techniques define a certainty level for classifications. Another approach suggests enhancing the random forest techniques to cope with corrupted data items by augmenting the certainty level for the classification obtained in each leaf in the forest. This method is of independent interest, that of significantly improving the classification of the random forest machine learning technique in less severe settings.