The art of computer programming, volume 3: (2nd ed.) sorting and searching
The art of computer programming, volume 3: (2nd ed.) sorting and searching
Implementing a relational database by means of specialzed hardware
ACM Transactions on Database Systems (TODS)
System R: relational approach to database management
ACM Transactions on Database Systems (TODS)
A taxonomy of parallel sorting
ACM Computing Surveys (CSUR)
Sorting Large Files on a Backend Multiprocessor
IEEE Transactions on Computers
Index scans using a finite LRU buffer: a validated I/O model
ACM Transactions on Database Systems (TODS)
An experimental analysis of the performance of fourth generation tools on PCs
Communications of the ACM
Optimization Strategies for Relational Queries
IEEE Transactions on Software Engineering
A linear-time probabilistic counting algorithm for database applications
ACM Transactions on Database Systems (TODS)
Minimal space, average linear time duplicate deletion
Communications of the ACM
Distributive join: a new algorithm for joining relations
ACM Transactions on Database Systems (TODS)
Query evaluation techniques for large databases
ACM Computing Surveys (CSUR)
Fast algorithms for universal quantification in large databases
ACM Transactions on Database Systems (TODS)
The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Fundamental techniques for order optimization
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
A fast filtering scheme for large database cleansing
Proceedings of the eleventh international conference on Information and knowledge management
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
Improving Data Quality in Practice: A Case Study in the Italian Public Administration
Distributed and Parallel Databases
Starburst Mid-Flight: As the Dust Clears
IEEE Transactions on Knowledge and Data Engineering
Computational Complexity of Sorting and Joining Relations with Duplicates
IEEE Transactions on Knowledge and Data Engineering
Volcano An Extensible and Parallel Query Evaluation System
IEEE Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering
Hashing Methods and Relational Algebra Operations
VLDB '84 Proceedings of the 10th International Conference on Very Large Data Bases
Computing Iceberg Queries Efficiently
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Translating Aggregate Queries into Iterative Programs
VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Telcordia's Database Reconciliation and Data Quality Analysis Tool
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Coalescing in Temporal Databases
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Benchmarking Database Systems A Systematic Approach
VLDB '83 Proceedings of the 9th International Conference on Very Large Data Bases
WAIM '01 Proceedings of the Second International Conference on Advances in Web-Age Information Management
Composition can be Faster than Join
COMPSAC '97 Proceedings of the 21st International Computer Software and Applications Conference
Dynamic Similarity for Fields with NULL Values
DaWaK 2000 Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery
Fuzzy Rule-Based Framework for Medical Record Validation
IDEAL '02 Proceedings of the Third International Conference on Intelligent Data Engineering and Automated Learning
Cleansing Data for Mining and Warehousing
DEXA '99 Proceedings of the 10th International Conference on Database and Expert Systems Applications
Exploiting early sorting and early partitioning for decision support query processing
The VLDB Journal — The International Journal on Very Large Data Bases
Learning domain-independent string transformation weights for high accuracy object identification
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A Bayesian decision model for cost optimal record matching
The VLDB Journal — The International Journal on Very Large Data Bases
Data reduction through early grouping
CASCON '94 Proceedings of the 1994 conference of the Centre for Advanced Studies on Collaborative research
Source integration for data warehousing
Multidimensional databases
Two supervised learning approaches for name disambiguation in author citations
Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Completeness of integrated information sources
Information Systems - Special issue: Data quality in cooperative information systems
Robust Identification of Fuzzy Duplicates
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Name disambiguation in author citations using a K-way spectral clustering method
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
A hierarchical naive Bayes mixture model for name disambiguation in author citations
Proceedings of the 2005 ACM symposium on Applied computing
IEEE Transactions on Knowledge and Data Engineering
Implementing sorting in database systems
ACM Computing Surveys (CSUR)
Leveraging aggregate constraints for deduplication
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
A Method for Estimating the Precision of Placename Matching
IEEE Transactions on Knowledge and Data Engineering
A comparative analysis of parallel disk-based Methods for enumerating implicit graphs
Proceedings of the 2007 international workshop on Parallel symbolic computation
Shooting stars in the sky: an online algorithm for skyline queries
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Canonicalization of graph database records using similarity measures
Proceedings of the 2nd international conference on Ubiquitous information management and communication
Combining Data Integration and IE Techniques to Support Partially Structured Data
NLDB '08 Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information Systems
A Term-Based Driven Clustering Approach for Name Disambiguation
APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)
Proceedings of the VLDB Endowment
New algorithms for join and grouping operations
Computer Science - Research and Development
Main memory implementations for binary grouping
XSym'05 Proceedings of the Third international conference on Database and XML Technologies
Foundations and Trends in Databases
An efficient approach to identify n-wMVD for eliminating data redundancy
Proceedings of the CUBE International Information Technology Conference
Memory efficient minimum substring partitioning
Proceedings of the VLDB Endowment
Hi-index | 0.03 |
The issue of duplicate elimination for large data files in which many occurrences of the same record may appear is addressed. A comprehensive cost analysis of the duplicate elimination operation is presented. This analysis is based on a combinatorial model developed for estimating the size of intermediate runs produced by a modified merge-sort procedure. The performance of this modified merge-sort procedure is demonstrated to be significantly superior to the standard duplicate elimination technique of sorting followed by a sequential pass to locate duplicate records. The results can also be used to provide critical input to a query optimizer in a relational database system.