A New Efficient Data Cleansing Method

  • Authors:
  • Li Zhao;Sung Sam Yuan;Sun Peng;Tok Wang Ling

  • Affiliations:
  • -;-;-;-

  • Venue:
  • DEXA '02 Proceedings of the 13th International Conference on Database and Expert Systems Applications
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

One of the most important tasks in data cleansing is to detect and remove duplicate records, which consists of two main components, detection and comparison. A detection method decides which records will be compared, and a comparison method determines whether two records compared are duplicate. Comparisons take a great deal of data cleansing time. We discover that if certain properties are satisfied by a comparison method then many unnecessary expensive comparisons can be avoided. In this paper, we first propose a new comparison method, LCSS, based on the longest common subsequence, and show that it possesses the desired properties. We then propose two new detection methods, SNM-IN and SNM-INOUT, which are variances of the popular detection method SNM. The performance study on real and synthetic databases shows that the integration of SNM-IN (SNM-INOUT) and LCSS saves about 39% (56%) of comparisons.