Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

  • Authors:
  • Mauricio A. Hernández;Salvatore J. Stolfo

  • Affiliations:
  • Department of Computer Science, Columbia University, New York, NY 10027.;Department of Computer Science, Columbia University, New York, NY 10027.

  • Venue:
  • Data Mining and Knowledge Discovery
  • Year:
  • 1998

Quantified Score

Hi-index 0.00

Visualization

Abstract

The problem of merging multiple databases of information aboutcommon entities is frequently encountered in KDD and decision supportapplications in large commercial and government organizations. The problemwe study is often called the Merge/Purge problem and is difficult to solveboth in scale and accuracy. Large repositories of data typically havenumerous duplicate information entries about the same entities that aredifficult to cull together without an intelligent ’’equational theory‘‘ thatidentifies equivalent items by a complex, domain-dependent matching process.We have developed a system for accomplishing this Data Cleansing task anddemonstrate its use for cleansing lists of names of potential customers in adirect marketing-type application. Our results for statistically generateddata are shown to be accurate and effective when processing the datamultiple times using different keys for sorting on each successive pass.Combing results of individual passes using transitive closure over theindependent results, produces far more accurate results at lower cost. Thesystem provides a rule programming module that is easy to program and quite good at finding duplicates especially in an environment with massiveamounts of data. This paper details improvements in our system, and reportson the successful implementation for a real-world database that conclusivelyvalidates our results previously achieved for statistically generateddata.