On memory and I/O efficient duplication detection for multiple self-clean data sources

  • Authors:
  • Ji Zhang;Yanfeng Shu;Hua Wang

  • Affiliations:
  • Department of Mathematics and Computing, The University of Southern Queensland, Australia;CSIRO, ICT Centre, Hobart, Australia;Department of Mathematics and Computing, The University of Southern Queensland, Australia

  • Venue:
  • DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we propose efficient algorithms for duplicate detection from multiple data sources that are themselves duplicate-free. When developing these algorithms, we take the full consideration of various possible cases given the workload of data sources to be cleaned and the available memory. These algorithms are memory and I/O efficient, being able to reduce the number of pairwise record comparison and minimize the total page access cost involved in the cleaning process. Experimental evaluation demonstrates that the algorithms we propose are efficient and are able to achieve better performance than SNM and random access methods.