A New Efficient Data Cleansing Method

Authors:
Li Zhao;Sung Sam Yuan;Sun Peng;Tok Wang Ling
Affiliations:
-;-;-;-
Venue:
DEXA '02 Proceedings of the 13th International Conference on Database and Expert Systems Applications
Year:
2002

Citing 6
Cited 1

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
IntelliClean: a knowledge-based intelligent data cleaner

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Database research: achievements and opportunities into the 1st century

ACM SIGMOD Record
Declarative Data Cleaning: Language, Model, and Algorithms

Proceedings of the 27th International Conference on Very Large Data Bases
Potter's Wheel: An Interactive Data Cleaning System

Proceedings of the 27th International Conference on Very Large Data Bases
Cleansing Data for Mining and Warehousing

DEXA '99 Proceedings of the 10th International Conference on Database and Expert Systems Applications

On memory and I/O efficient duplication detection for multiple self-clean data sources

DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the most important tasks in data cleansing is to detect and remove duplicate records, which consists of two main components, detection and comparison. A detection method decides which records will be compared, and a comparison method determines whether two records compared are duplicate. Comparisons take a great deal of data cleansing time. We discover that if certain properties are satisfied by a comparison method then many unnecessary expensive comparisons can be avoided. In this paper, we first propose a new comparison method, LCSS, based on the longest common subsequence, and show that it possesses the desired properties. We then propose two new detection methods, SNM-IN and SNM-INOUT, which are variances of the popular detection method SNM. The performance study on real and synthetic databases shows that the integration of SNM-IN (SNM-INOUT) and LCSS saves about 39% (56%) of comparisons.