Duplicate record elimination in large data files

  • Authors:
  • Dina Bitton;David J. DeWitt

  • Affiliations:
  • Univ. of Wisconsin-Madison, Madison;Univ. of Wisconsin-Madison, Madison

  • Venue:
  • ACM Transactions on Database Systems (TODS)
  • Year:
  • 1983

Quantified Score

Hi-index 0.03

Visualization

Abstract

The issue of duplicate elimination for large data files in which many occurrences of the same record may appear is addressed. A comprehensive cost analysis of the duplicate elimination operation is presented. This analysis is based on a combinatorial model developed for estimating the size of intermediate runs produced by a modified merge-sort procedure. The performance of this modified merge-sort procedure is demonstrated to be significantly superior to the standard duplicate elimination technique of sorting followed by a sequential pass to locate duplicate records. The results can also be used to provide critical input to a query optimizer in a relational database system.