Probabilistic Noise Identification and Data Cleaning

Authors:
Jeremy Kubica;Andrew Moore
Affiliations:
-;-
Venue:
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Year:
2003

Citing 5
Cited 17

Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
Discovering informative patterns and data cleaning

Advances in knowledge discovery and data mining
Correcting Noisy Data

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Experiments with Noise Filtering in a Medical Domain

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Identifying and eliminating mislabeled training instances

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1

Class Noise vs. Attribute Noise: A Quantitative Study

Artificial Intelligence Review
Cleaning microarray expression data using Markov random fields based on profile similarity

Proceedings of the 2005 ACM symposium on Applied computing
Class noise vs. attribute noise: a quantitative study of their impacts

Artificial Intelligence Review
Bridging Local and Global Data Cleansing: Identifying Class Noise in Large, Distributed Data Datasets

Data Mining and Knowledge Discovery
A surrogate variable-based data mining method using CFS and RSM

ACOS'07 Proceedings of the 6th Conference on WSEAS International Conference on Applied Computer Science - Volume 6
Conceptual equivalence for contrast mining in classification learning

Data & Knowledge Engineering
Domain independent data discrepancy detection using ensemble learning

ICCOMP'08 Proceedings of the 12th WSEAS international conference on Computers
Application-Independent Feature Construction from Noisy Samples

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Mining in Large Noisy Domains

Journal of Data and Information Quality (JDIQ)
Error detection and impact-sensitive instance ranking in noisy datasets

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Which Is Better for Frequent Pattern Mining: Approximate Counting or Sampling?

DaWaK '09 Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery
Modeling and querying possible repairs in duplicate detection

Proceedings of the VLDB Endowment
ERACER: a database approach for statistical inference and data cleaning

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Sensitivity of different machine learning algorithms to noise

Journal of Computing Sciences in Colleges
Discovery of frequent patterns in transactional data streams

Transactions on large-scale data- and knowledge-centered systems II
Discovery of frequent patterns in transactional data streams

Transactions on large-scale data- and knowledge-centered systems II
Classifying noisy data streams

FSKD'06 Proceedings of the Third international conference on Fuzzy Systems and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

Real world data is never as perfect as we would like itto be and can often suffer from corruptions that may impactinterpretations of the data, models created from thedata, and decisions made based on the data.One approachto this problem is to identify and remove records that containcorruptions.Unfortunately, if only certain fields in arecord have been corrupted then usable, uncorrupted datawill be lost.In this paper we present LENS, an approach foridentifying corrupted fields and using the remaining non-corruptedfields for subsequent modeling and analysis.Ourapproach uses the data to learn a probabilistic model containingthree components: a generative model of the cleanrecords, a generative model of the noise values, and a probabilisticmodel of the corruption process.We provide an algorithmfor the unsupervised discovery of such models andempirically evaluate both its performance at detecting corruptedfields and, as one example application, the resultingimprovement this gives to a classifier.