The merge/purge problem for large databases

Authors:
Mauricio A. Hernández;Salvatore J. Stolfo
Affiliations:
Department of Computer Science, Columbia University, New York, NY;Department of Computer Science, Columbia University, New York, NY
Venue:
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Year:
1995

Citing 10
Cited 202

Multiprocessor transitive closure algorithms

DPDS '88 Proceedings of the first international symposium on Databases in parallel and distributed systems
The breakdown of the information model in multi-database systems

ACM SIGMOD Record
Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
AlphaSort: a RISC machine sort

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Automatic correction to misspelled names: a fourth-generation language approach

Communications of the ACM
Duplicate record elimination in large data files

ACM Transactions on Database Systems (TODS)
Predictive dynamic load balancing of parallel hash-joins over heterogeneous processors in the presence of data skew

PDIS '94 Proceedings of the third international conference on on Parallel and distributed information systems
The Inter-Database Instance Identification Problem in Integrating Autonomous Systems

Proceedings of the Fifth International Conference on Data Engineering
An Evaluation of Non-Equijoin Algorithms

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Physical database design in multiprocessor database systems

Physical database design in multiprocessor database systems

PERF join: an alternative to two-way semijoin and bloomjoin

CIKM '95 Proceedings of the fourth international conference on Information and knowledge management
Estimating alphanumeric selectivity in the presence of wildcards

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Substring selectivity estimation

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Duplicate detection using k-way sorting method

SAC '00 Proceedings of the 2000 ACM symposium on Applied computing - Volume 1
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Hardening soft information sources

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
IntelliClean: a knowledge-based intelligent data cleaner

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Exploration mining in diabetic patients databases: findings and conclusions

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Data integration using similarity joins and a word-based information representation language

ACM Transactions on Information Systems (TOIS)
Information retrieval on the web

ACM Computing Surveys (CSUR)
Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Expressive retrieval from XML documents

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Advanced grouping and aggregation for data integration

Proceedings of the tenth international conference on Information and knowledge management
An expressive and efficient language for XML information retrieval

Journal of the American Society for Information Science and Technology - XML
Learning missing values from summary constraints

ACM SIGKDD Explorations Newsletter
A fast filtering scheme for large database cleansing

Proceedings of the eleventh international conference on Information and knowledge management
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Improving Data Quality in Practice: A Case Study in the Italian Public Administration

Distributed and Parallel Databases
Database Technology for Decision Support Systems

Computer
Warehouse Creation-A Potential Roadblock to Data Warehousing

IEEE Transactions on Knowledge and Data Engineering
A Distance-Based Approach to Entity Reconciliation in Heterogeneous Databases

IEEE Transactions on Knowledge and Data Engineering
Ontology-Based Data Cleaning

NLDB '02 Proceedings of the 6th International Conference on Applications of Natural Language to Information Systems-Revised Papers
Telcordia's Database Reconciliation and Data Quality Analysis Tool

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Declarative Data Cleaning: Language, Model, and Algorithms

Proceedings of the 27th International Conference on Very Large Data Bases
Dynamic Similarity for Fields with NULL Values

DaWaK 2000 Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery
Fuzzy Rule-Based Framework for Medical Record Validation

IDEAL '02 Proceedings of the Third International Conference on Intelligent Data Engineering and Automated Learning
Cleansing Data for Mining and Warehousing

DEXA '99 Proceedings of the 10th International Conference on Database and Expert Systems Applications
A New Efficient Data Cleansing Method

DEXA '02 Proceedings of the 13th International Conference on Database and Expert Systems Applications
Heterogeneous Data Source Integration and Evolution

DEXA '02 Proceedings of the 13th International Conference on Database and Expert Systems Applications
One-dimensional and multi-dimensional substring selectivity estimation

The VLDB Journal — The International Journal on Very Large Data Bases
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Text joins in an RDBMS for web data integration

WWW '03 Proceedings of the 12th international conference on World Wide Web
Data warehousing

Handbook of massive data sets
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Source integration for data warehousing

Multidimensional databases
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Cleaning the Spurious Links in Data

IEEE Intelligent Systems
Efficient similarity-based operations for data integration

Data & Knowledge Engineering
Information-theoretic tools for mining database structure from large data sets

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Iterative record linkage for cleaning and integration

Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Detecting duplicate objects in XML documents

Proceedings of the 2004 international workshop on Information quality in information systems
A hierarchical graphical model for record linkage

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Comparative study of name disambiguation problem using a scalable blocking-based framework

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
DogmatiX tracks down duplicates in XML

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Data cleaning in microsoft SQL server 2005

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Knowledge Accumulation and Resolution of Data Inconsistencies during the Integration of Microbial Information Sources

IEEE Transactions on Knowledge and Data Engineering
Exploiting relationships for object consolidation

Proceedings of the 2nd international workshop on Information quality in information systems
Blocking-aware private record linkage

Proceedings of the 2nd international workshop on Information quality in information systems
Effective and scalable solutions for mixed and split citation problems in digital libraries

Proceedings of the 2nd international workshop on Information quality in information systems
A hit-miss model for duplicate detection in the WHO drug safety database

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Selectivity estimation for fuzzy string predicates in large data sets

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Indexing mixed types for approximate retrieval

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Relational clustering for multi-type entity resolution

MRDM '05 Proceedings of the 4th international workshop on Multi-relational mining
Automatically utilizing secondary sources to align information across sources

AI Magazine - Special issue on semantic integration
Semantic integration in text: from ambiguous names to identifiable entities

AI Magazine - Special issue on semantic integration
Semantic-integration research in the database community

AI Magazine - Special issue on semantic integration
Establishing value mappings using statistical models and user feedback

Proceedings of the 14th ACM international conference on Information and knowledge management
Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
A Heterogeneous Field Matching Method for Record Linkage

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Enhancing Data Analysis with Noise Removal

IEEE Transactions on Knowledge and Data Engineering
Adaptive Name Matching in Information Integration

IEEE Intelligent Systems
Profile-Based Object Matching for Information Integration

IEEE Intelligent Systems
Domain-independent data cleaning via analysis of entity-relationship graph

ACM Transactions on Database Systems (TODS)
An effective approach to entity resolution problem using quasi-clique and its application to digital libraries
Query-time entity resolution

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Describing differences between databases

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
The pairwise attribute noise detection algorithm

Knowledge and Information Systems - Special Issue on Mining Low-Quality Data
Leveraging aggregate constraints for deduplication

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Adaptive sorted neighborhood methods for efficient record linkage

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Adaptive graphical approach to entity resolution

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Management of probabilistic data: foundations and challenges

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Systematic development of data mining-based data quality tools

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Merging the results of approximate match operations

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Web based linkage

Proceedings of the 9th annual ACM international workshop on Web information and data management
Management of data with uncertainties

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Parallel linkage

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Structure-based inference of xml similarity for fuzzy duplicate detection

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Example-driven design of efficient record matching queries

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Measuring the structural similarity of semistructured documents using entropy

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Entity matching in heterogeneous databases: A logistic regression approach

Decision Support Systems
Increasing the performance of an application for duplication detection

CompSysTech '07 Proceedings of the 2007 international conference on Computer systems and technologies
Survey on test collections and techniques for personal name matching

International Journal of Metadata, Semantics and Ontologies
Febrl: a freely available record linkage system with a graphical user interface

HDKM '08 Proceedings of the second Australasian workshop on Health data and knowledge management - Volume 80
Video linkage: group based copied video detection

CIVR '08 Proceedings of the 2008 international conference on Content-based image and video retrieval
SEPIA: estimating selectivities of approximate string predicates in large Databases

The VLDB Journal — The International Journal on Very Large Data Bases
Lexicon randomization for near-duplicate detection with I-Match

The Journal of Supercomputing
Structured entity identification and document categorization: two tasks with one joint model

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
A Graph Partitioning Approach to Entity Disambiguation Using Uncertain Information

GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Approximate lineage for probabilistic databases

Proceedings of the VLDB Endowment
Industry-scale duplicate detection

Proceedings of the VLDB Endowment
Scaling up duplicate detection in graph data

Proceedings of the 17th ACM conference on Information and knowledge management
On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages

Information Retrieval
Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
Disambiguating authors in academic publications using random forests

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Exploiting context analysis for combining multiple entity resolution systems

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Entity resolution with iterative blocking

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Disambiguating Personal Names on the Web using Automatically Extracted Key Phrases

Proceedings of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29 -- September 1, 2006, Riva del Garda, Italy
Optimal Stopping: A Record-Linkage Approach

Journal of Data and Information Quality (JDIQ)
Identification and tracing of ambiguous names: discriminative and generative approaches

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Constraint-based entity matching

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Query-time entity resolution

Journal of Artificial Intelligence Research
Unsupervised methods for determining object and relation synonyms on the web

Journal of Artificial Intelligence Research
The trichotomy of HAVING queries on a probabilistic database

The VLDB Journal — The International Journal on Very Large Data Bases
Context-sensitive document ranking

Proceedings of the 18th ACM conference on Information and knowledge management
Similarity-aware indexing for real-time entity resolution

Proceedings of the 18th ACM conference on Information and knowledge management
Record linkage performance for large data sets

Proceedings of the ACM first international workshop on Privacy and anonymity for very large databases
Development and user experiences of an open source data cleaning, deduplication and record linkage system

ACM SIGKDD Explorations Newsletter
Generic entity resolution with negative rules

The VLDB Journal — The International Journal on Very Large Data Bases
Frameworks for entity matching: A comparison

Data & Knowledge Engineering
Reasoning about record matching rules

Proceedings of the VLDB Endowment
Power-law based estimation of set similarity join size

Proceedings of the VLDB Endowment
XML data mining

Software—Practice & Experience
An incremental clustering scheme for data de-duplication

Data Mining and Knowledge Discovery
Learning similarity metrics for event identification in social media

Proceedings of the third ACM international conference on Web search and data mining
HARRA: fast iterative hashed record linkage for large-scale data collections

Proceedings of the 13th International Conference on Extending Database Technology
Using similarity-based operations for resolving data-level conflicts

BNCOD'03 Proceedings of the 20th British national conference on Databases
Declarative XML data cleaning with XClean

CAiSE'07 Proceedings of the 19th international conference on Advanced information systems engineering
Efficient evaluation of HAVING queries on a probabilistic database

DBPL'07 Proceedings of the 11th international conference on Database programming languages
Self-tuning in graph-based reference disambiguation

DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications
Scaling record linkage to non-uniform distributed class sizes

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Detecting near-duplicates in large-scale short text databases

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
On active learning of record matching packages

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
On memory and I/O efficient duplication detection for multiple self-clean data sources

DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
A graphical method for reference reconciliation

DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
An efficient duplicate record detection using q-grams array inverted index

DaWaK'10 Proceedings of the 12th international conference on Data warehousing and knowledge discovery
Feature-based entity matching: the FBEM model, implementation, evaluation

CAiSE'10 Proceedings of the 22nd international conference on Advanced information systems engineering
A multilevel and domain-independent duplicate detection model for scientific database

WAIM'10 Proceedings of the 11th international conference on Web-age information management
On Graph-Based Name Disambiguation

Journal of Data and Information Quality (JDIQ)
Evaluating entity resolution results

Proceedings of the VLDB Endowment
Evaluation of entity resolution approaches on real-world match problems

Proceedings of the VLDB Endowment
Entity resolution with evolving rules

Proceedings of the VLDB Endowment
Efficient entity resolution for large heterogeneous information spaces

Proceedings of the fourth ACM international conference on Web search and data mining
Entity Resolution and Information Quality

Entity Resolution and Information Quality
Context-sensitive document ranking

Journal of Computer Science and Technology
Identity matching using personal and social identity features

Information Systems Frontiers
SemGen: towards a semantic data generator for benchmarking duplicate detectors

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
Eliminating the redundancy in blocking-based entity resolution methods

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Detecting and exploiting stability in evolving heterogeneous information spaces

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
To compare or not to compare: making entity resolution more efficient

Proceedings of the International Workshop on Semantic Web Information Management
Differential dependencies: Reasoning and discovery

ACM Transactions on Database Systems (TODS)
A supervised machine learning approach for duplicate detection over gazetteer records

GeoS'11 Proceedings of the 4th international conference on GeoSpatial semantics
Matching unstructured product offers to structured product specifications

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Entity matching: how similar is similar

Proceedings of the VLDB Endowment
Privacy preserving group linkage

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
Dynamic constraints for record matching

The VLDB Journal — The International Journal on Very Large Data Bases
Efficient duplicate detection on cloud using a new signature scheme

WAIM'11 Proceedings of the 12th international conference on Web-age information management
Meta similarity

Applied Intelligence
Duplicate detection through structure optimization

Proceedings of the 20th ACM international conference on Information and knowledge management
Instance-based 'one-to-some' assignment of similarity measures to attributes

OTM'11 Proceedings of the 2011th Confederated international conference on On the move to meaningful internet systems - Volume Part I
Identifying co-referential names across large corpora

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Object identification with attribute-mediated dependences

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
Probabilistic data generation for deduplication and data linkage

IDEAL'05 Proceedings of the 6th international conference on Intelligent Data Engineering and Automated Learning
Attribute and object selection queries on objects with probabilistic attributes

ACM Transactions on Database Systems (TODS)
Quality-aware similarity assessment for entity matching in Web data

Information Systems
Identifying value mappings for data integration: an unsupervised approach

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Extracting key phrases to disambiguate personal names on the web

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
XML duplicate detection using sorted neighborhoods

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data

Proceedings of the fifth ACM international conference on Web search and data mining
Multi-pass sorted neighborhood blocking with MapReduce

Computer Science - Research and Development
Active duplicate detection

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
A transparent and transportable methodology for evaluating Data Linkage software

Journal of Biomedical Informatics
Similarity and duplicate detection system for an OAI compliant federated digital library

ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries
A graph theoretic approach to key equivalence

MICAI'05 Proceedings of the 4th Mexican international conference on Advances in Artificial Intelligence
Probability and equality: a probabilistic model of identity uncertainty

AI'05 Proceedings of the 18th Canadian Society conference on Advances in Artificial Intelligence
A self-monitoring system to satisfy data quality requirements

OTM'05 Proceedings of the 2005 OTM Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, COA, and ODBASE - Volume Part II
Cleaning web pages for effective web content mining

DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
Extracting mnemonic names of people from the web

ICADL'06 Proceedings of the 9th international conference on Asian Digital Libraries: achievements, Challenges and Opportunities
Finding related tables

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Integrating open government data with stratosphere for more transparency

Web Semantics: Science, Services and Agents on the World Wide Web
The effect of suspicious profiles on people recommenders

UMAP'12 Proceedings of the 20th international conference on User Modeling, Adaptation, and Personalization
OtO matching system: a multi-strategy approach to instance matching

CAiSE'12 Proceedings of the 24th international conference on Advanced Information Systems Engineering
Entity resolution: theory, practice & open challenges

Proceedings of the VLDB Endowment
A discriminative hierarchical model for fast coreference at large scale

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
De-duplication of aggregation authority files

International Journal of Metadata, Semantics and Ontologies
An automatic blocking mechanism for large-scale de-duplication tasks

Proceedings of the 21st ACM international conference on Information and knowledge management
Adaptive Connection Strength Models for Relationship-Based Entity Resolution

Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution
Indeterministic Handling of Uncertain Decisions in Deduplication

Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution
Schema matching and embedded value mapping for databases with opaque column names and mixed continuous and discrete-valued data fields

ACM Transactions on Database Systems (TODS)
Deep Web Information Retrieval Process: A Technical Survey

International Journal of Information Technology and Web Engineering
Towards scalable real-time entity resolution using a similarity-aware inverted index approach

AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87
20 years of data quality research: themes, trends and synergies

ADC '11 Proceedings of the Twenty-Second Australasian Database Conference - Volume 115
MFIBlocks: An effective blocking algorithm for entity resolution

Information Systems
A taxonomy of privacy-preserving record linkage techniques

Information Systems
Efficient XML duplicate detection using an adaptive two-level optimization

Proceedings of the 28th Annual ACM Symposium on Applied Computing
An efficient two-party protocol for approximate matching in private record linkage

AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121
Tuning large scale deduplication with reduced effort

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
An automatic blocking strategy for XML duplicate detection

ACM SIGAPP Applied Computing Review
Disinformation techniques for entity resolution

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Flexible and extensible generation and corruption of personal data

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
GeCo: an online personal data generator and corruptor

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
De-duplication of aggregation authority files

International Journal of Metadata, Semantics and Ontologies
Question selection for crowd entity resolution

Proceedings of the VLDB Endowment
Query-driven approach to entity resolution

Proceedings of the VLDB Endowment
Efficient entity matching using materialized lists

Information Sciences: an International Journal
Incremental entity resolution on rules and data

The VLDB Journal — The International Journal on Very Large Data Bases
Joint entity resolution on multiple datasets

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many commercial organizations routinely gather large numbers of databases for various marketing and business analysis functions. The task is to correlate information from different databases by identifying distinct individuals that appear in a number of different databases typically in an inconsistent and often incorrect fashion. The problem we study here is the task of merging data from multiple sources in as efficient manner as possible, while maximizing the accuracy of the result. We call this the merge/purge problem. In this paper we detail the sorted neighborhood method that is used by some to solve merge/purge and present experimental results that demonstrates this approach may work well in practice but at great expense. An alternative method based upon clustering is also presented with a comparative evaluation to the sorted neighborhood method. We show a means of improving the accuracy of the results based upon a multi-pass approach that succeeds by computing the Transitive Closure over the results of independent runs considering alternative primary key attributes in each pass.