Cleaning the Spurious Links in Data

  • Authors:
  • Mong Li Lee;Wynne Hsu;Vijay Kothari

  • Affiliations:
  • -;-;-

  • Venue:
  • IEEE Intelligent Systems
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data quality problems can arise from abbreviations, data entry mistakes, duplicate records, missing fields, and many other sources. Data-cleaning research has focused on duplicate elimination or the merge/purge problem. Another problem is erroneous data called spurious links, where a real-world entity has multiple record links that might not be properly associated with it. One approach to this problem is to use context information to clean up the spurious links. This approach identifies and retrieves the data containing potential spurious links, then performs a context similarity comparison to determine records with high overlaps. The degree of overlapping context indicates the likelihood of spurious links. Experiments on three real-world data sets demonstrate that this approach can correctly identify spurious links and thus assist data cleaning.