Probabilistic Noise Identification and Data Cleaning

  • Authors:
  • Jeremy Kubica;Andrew Moore

  • Affiliations:
  • -;-

  • Venue:
  • ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Real world data is never as perfect as we would like itto be and can often suffer from corruptions that may impactinterpretations of the data, models created from thedata, and decisions made based on the data.One approachto this problem is to identify and remove records that containcorruptions.Unfortunately, if only certain fields in arecord have been corrupted then usable, uncorrupted datawill be lost.In this paper we present LENS, an approach foridentifying corrupted fields and using the remaining non-corruptedfields for subsequent modeling and analysis.Ourapproach uses the data to learn a probabilistic model containingthree components: a generative model of the cleanrecords, a generative model of the noise values, and a probabilisticmodel of the corruption process.We provide an algorithmfor the unsupervised discovery of such models andempirically evaluate both its performance at detecting corruptedfields and, as one example application, the resultingimprovement this gives to a classifier.