On memory and I/O efficient duplication detection for multiple self-clean data sources

Authors:
Ji Zhang;Yanfeng Shu;Hua Wang
Affiliations:
Department of Mathematics and Computing, The University of Southern Queensland, Australia;CSIRO, ICT Centre, Hobart, Australia;Department of Mathematics and Computing, The University of Southern Queensland, Australia
Venue:
DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
Year:
2010

Citing 9
Cited 0

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Improving data warehouse and business information quality: methods for reducing costs and increasing profits

Improving data warehouse and business information quality: methods for reducing costs and increasing profits
A knowledge-based approach for duplicate elimination in data cleaning

Information Systems - Data extraction, cleaning and reconciliation
A fast filtering scheme for large database cleansing

Proceedings of the eleventh international conference on Information and knowledge management
A New Efficient Data Cleansing Method

DEXA '02 Proceedings of the 13th International Conference on Database and Expert Systems Applications
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Information-theoretic tools for mining database structure from large data sets

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Detecting duplicate objects in XML documents

Proceedings of the 2004 international workshop on Information quality in information systems
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose efficient algorithms for duplicate detection from multiple data sources that are themselves duplicate-free. When developing these algorithms, we take the full consideration of various possible cases given the workload of data sources to be cleaned and the available memory. These algorithms are memory and I/O efficient, being able to reduce the number of pairwise record comparison and minimize the total page access cost involved in the cleaning process. Experimental evaluation demonstrates that the algorithms we propose are efficient and are able to achieve better performance than SNM and random access methods.