Duplicate record elimination in large data files

Authors:
Dina Bitton;David J. DeWitt
Affiliations:
Univ. of Wisconsin-Madison, Madison;Univ. of Wisconsin-Madison, Madison
Venue:
ACM Transactions on Database Systems (TODS)
Year:
1983

Citing 3
Cited 56

The art of computer programming, volume 3: (2nd ed.) sorting and searching

The art of computer programming, volume 3: (2nd ed.) sorting and searching
Implementing a relational database by means of specialzed hardware

ACM Transactions on Database Systems (TODS)
System R: relational approach to database management

ACM Transactions on Database Systems (TODS)

A taxonomy of parallel sorting

ACM Computing Surveys (CSUR)
Sorting Large Files on a Backend Multiprocessor

IEEE Transactions on Computers
Index scans using a finite LRU buffer: a validated I/O model

ACM Transactions on Database Systems (TODS)
An experimental analysis of the performance of fourth generation tools on PCs

Communications of the ACM
Optimization Strategies for Relational Queries

IEEE Transactions on Software Engineering
A linear-time probabilistic counting algorithm for database applications

ACM Transactions on Database Systems (TODS)
Minimal space, average linear time duplicate deletion

Communications of the ACM
Distributive join: a new algorithm for joining relations

ACM Transactions on Database Systems (TODS)
Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
Fast algorithms for universal quantification in large databases

ACM Transactions on Database Systems (TODS)
The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Fundamental techniques for order optimization

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
A fast filtering scheme for large database cleansing

Proceedings of the eleventh international conference on Information and knowledge management
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Improving Data Quality in Practice: A Case Study in the Italian Public Administration

Distributed and Parallel Databases
Starburst Mid-Flight: As the Dust Clears

IEEE Transactions on Knowledge and Data Engineering
Computational Complexity of Sorting and Joining Relations with Duplicates

IEEE Transactions on Knowledge and Data Engineering
Volcano— An Extensible and Parallel Query Evaluation System

IEEE Transactions on Knowledge and Data Engineering
Sort vs. Hash Revisited

IEEE Transactions on Knowledge and Data Engineering
Domains and Active Domains: What This Distinction Implies for the Estimation of Projection Sizes in Relational Databases

IEEE Transactions on Knowledge and Data Engineering
Hashing Methods and Relational Algebra Operations

VLDB '84 Proceedings of the 10th International Conference on Very Large Data Bases
Computing Iceberg Queries Efficiently

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Translating Aggregate Queries into Iterative Programs

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Telcordia's Database Reconciliation and Data Quality Analysis Tool

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Coalescing in Temporal Databases

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Benchmarking Database Systems A Systematic Approach

VLDB '83 Proceedings of the 9th International Conference on Very Large Data Bases
Comparing String Similarity Measures for Reducing Inconsistency in Integrating Data from Different Sources

WAIM '01 Proceedings of the Second International Conference on Advances in Web-Age Information Management
Composition can be Faster than Join

COMPSAC '97 Proceedings of the 21st International Computer Software and Applications Conference
Dynamic Similarity for Fields with NULL Values

DaWaK 2000 Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery
Fuzzy Rule-Based Framework for Medical Record Validation

IDEAL '02 Proceedings of the Third International Conference on Intelligent Data Engineering and Automated Learning
Cleansing Data for Mining and Warehousing

DEXA '99 Proceedings of the 10th International Conference on Database and Expert Systems Applications
Exploiting early sorting and early partitioning for decision support query processing

The VLDB Journal — The International Journal on Very Large Data Bases
Learning domain-independent string transformation weights for high accuracy object identification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A Bayesian decision model for cost optimal record matching

The VLDB Journal — The International Journal on Very Large Data Bases
Data reduction through early grouping

CASCON '94 Proceedings of the 1994 conference of the Centre for Advanced Studies on Collaborative research
Source integration for data warehousing

Multidimensional databases
Two supervised learning approaches for name disambiguation in author citations

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Completeness of integrated information sources

Information Systems - Special issue: Data quality in cooperative information systems
Robust Identification of Fuzzy Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Name disambiguation in author citations using a K-way spectral clustering method

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
A hierarchical naive Bayes mixture model for name disambiguation in author citations

Proceedings of the 2005 ACM symposium on Applied computing
Knowledge Accumulation and Resolution of Data Inconsistencies during the Integration of Microbial Information Sources

IEEE Transactions on Knowledge and Data Engineering
Implementing sorting in database systems

ACM Computing Surveys (CSUR)
Leveraging aggregate constraints for deduplication

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
A Method for Estimating the Precision of Placename Matching

IEEE Transactions on Knowledge and Data Engineering
A comparative analysis of parallel disk-based Methods for enumerating implicit graphs

Proceedings of the 2007 international workshop on Parallel symbolic computation
Shooting stars in the sky: an online algorithm for skyline queries

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Canonicalization of graph database records using similarity measures

Proceedings of the 2nd international conference on Ubiquitous information management and communication
Combining Data Integration and IE Techniques to Support Partially Structured Data

NLDB '08 Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information Systems
A Term-Based Driven Clustering Approach for Name Disambiguation

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
New algorithms for join and grouping operations

Computer Science - Research and Development
Main memory implementations for binary grouping

XSym'05 Proceedings of the Third international conference on Database and XML Technologies
Modern B-Tree Techniques

Foundations and Trends in Databases
An efficient approach to identify n-wMVD for eliminating data redundancy

Proceedings of the CUBE International Information Technology Conference
Memory efficient minimum substring partitioning

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.03

Visualization

Abstract

The issue of duplicate elimination for large data files in which many occurrences of the same record may appear is addressed. A comprehensive cost analysis of the duplicate elimination operation is presented. This analysis is based on a combinatorial model developed for estimating the size of intermediate runs produced by a modified merge-sort procedure. The performance of this modified merge-sort procedure is demonstrated to be significantly superior to the standard duplicate elimination technique of sorting followed by a sequential pass to locate duplicate records. The results can also be used to provide critical input to a query optimizer in a relational database system.