An Incremental Clustering Scheme for Duplicate Detection in Large Databases

Authors:
Eugenio Cesario;Francesco Folino;Giuseppe Manco;Luigi Pontieri
Affiliations:
ICAR-CNR;ICAR-CNR;ICAR-CNR;ICAR-CNR
Venue:
IDEAS '05 Proceedings of the 9th International Database Engineering & Application Symposium
Year:
2005

Citing 0
Cited 1

An incremental clustering scheme for data de-duplication

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose an incremental algorithm for clustering duplicate tuples in large databases, which allows to assign any new tuple t to the cluster containing the database tuples which are most similar to t (and hence are likely to refer to the same real-world entity t is associated with). The core of the approach is a hash-based indexing technique that tends to assign highly similar objects to the same buckets. Empirical evaluation proves that the proposed method allows to gain considerable efficiency improvement over a state-of-art index structure for proximity searches in metric spaces.