Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Authors:
Mauricio A. Hernández;Salvatore J. Stolfo
Affiliations:
Department of Computer Science, Columbia University, New York, NY 10027.;Department of Computer Science, Columbia University, New York, NY 10027.
Venue:
Data Mining and Knowledge Discovery
Year:
1998

Citing 9
Cited 141

A comparative analysis of methodologies for database schema integration

ACM Computing Surveys (CSUR)
Multiprocessor transitive closure algorithms

DPDS '88 Proceedings of the first international symposium on Databases in parallel and distributed systems
Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
AlphaSort: a RISC machine sort

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Duplicate record elimination in large data files

ACM Transactions on Database Systems (TODS)
Automatic spelling correction in scientific and scholarly text

Communications of the ACM
The Inter-Database Instance Identification Problem in Integrating Autonomous Systems

Proceedings of the Fifth International Conference on Data Engineering
Physical database design in multiprocessor database systems

Physical database design in multiprocessor database systems

Duplicate detection using k-way sorting method

SAC '00 Proceedings of the 2000 ACM symposium on Applied computing - Volume 1
Ordinal association rules for error identification in data sets

Proceedings of the tenth international conference on Information and knowledge management
Mining database structure; or, how to build a data quality browser

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Improving Data Quality in Practice: A Case Study in the Italian Public Administration

Distributed and Parallel Databases
Efficient transitive closure reasoning in a combined class/part/containment hierarchy

Knowledge and Information Systems
Declarative Data Cleaning: Language, Model, and Algorithms

Proceedings of the 27th International Conference on Very Large Data Bases
Potter's Wheel: An Interactive Data Cleaning System

Proceedings of the 27th International Conference on Very Large Data Bases
Comparing String Similarity Measures for Reducing Inconsistency in Integrating Data from Different Sources

WAIM '01 Proceedings of the Second International Conference on Advances in Web-Age Information Management
Mediation in a dynamic context: arguing for a request-oriented approach and structuring it

Web-enabled systems integration
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning domain-independent string transformation weights for high accuracy object identification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A Bayesian decision model for cost optimal record matching

The VLDB Journal — The International Journal on Very Large Data Bases
Data quality through knowledge engineering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Two supervised learning approaches for name disambiguation in author citations

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Privacy-preserving data integration and sharing

Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Mining reference tables for automatic text segmentation

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
The DaQuinCIS architecture: a platform for exchanging and improving data quality in cooperative information systems

Information Systems - Special issue: Data quality in cooperative information systems
Entity identification for heterogeneous database integration: a multiple classifier system approach and empirical evaluation

Information Systems
Schema Matching Using Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Robust Identification of Fuzzy Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Name disambiguation in author citations using a K-way spectral clustering method

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
A cost-based model and effective heuristic for repairing constraints by value modification

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
A hierarchical naive Bayes mixture model for name disambiguation in author citations

Proceedings of the 2005 ACM symposium on Applied computing
Knowledge Accumulation and Resolution of Data Inconsistencies during the Integration of Microbial Information Sources

IEEE Transactions on Knowledge and Data Engineering
An experimental investigation of the impact of aggregation on the performance of data mining with logistic regression

Information and Management
Email data cleaning

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Enhancing Data Analysis with Noise Removal

IEEE Transactions on Knowledge and Data Engineering
Determining noisy instances relative to attributes of interest

Intelligent Data Analysis
Eliminating Duplicates in Information Integration: An Adaptive, Extensible Framework

IEEE Intelligent Systems
Semantic matching across heterogeneous data sources

Communications of the ACM - The patent holder's dilemma: buy, sell, or troll?
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Supporting data quality management in decision-making

Decision Support Systems
The pairwise attribute noise detection algorithm

Knowledge and Information Systems - Special Issue on Mining Low-Quality Data
Data quality awareness: a case study for cost optimal association rule mining

Knowledge and Information Systems - Special Issue on Mining Low-Quality Data
Combining schema and instance information for integrating heterogeneous data sources

Data & Knowledge Engineering
Identifying noisy features with the Pairwise Attribute Noise Detection Algorithm

Intelligent Data Analysis
Duplicate detection in adverse drug reaction surveillance

Data Mining and Knowledge Discovery
Benchmarking declarative approximate selection predicates

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
A Method for Estimating the Precision of Placename Matching

IEEE Transactions on Knowledge and Data Engineering
Improving data quality: consistency and accuracy

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Canonicalization of graph database records using similarity measures

Proceedings of the 2nd international conference on Ubiquitous information management and communication
Conditional functional dependencies for capturing data inconsistencies

ACM Transactions on Database Systems (TODS)
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Finding frequent items in probabilistic data

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Dependencies revisited for improving data quality

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Probabilistic top-k and ranking-aggregate queries

ACM Transactions on Database Systems (TODS)
Boosting text segmentation via progressive classification

Knowledge and Information Systems
Entity matching across heterogeneous data sources: An approach based on constrained cascade generalization

Data & Knowledge Engineering
Matching XML documents in highly dynamic applications

Proceedings of the eighth ACM symposium on Document engineering
Issues and effort in integrating data from heterogeneous software repositories and corporate databases

Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement
Combining Data Integration and IE Techniques to Support Partially Structured Data

NLDB '08 Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information Systems
Probabilistic Entity Linkage for Heterogeneous Information Spaces

CAiSE '08 Proceedings of the 20th international conference on Advanced Information Systems Engineering
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Industry-scale duplicate detection

Proceedings of the VLDB Endowment
Data fusion

ACM Computing Surveys (CSUR)
Repair checking in inconsistent databases: algorithms and complexity

Proceedings of the 12th International Conference on Database Theory
Type-based categorization of relational attributes

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Overview and Framework for Data and Information Quality Research

Journal of Data and Information Quality (JDIQ)
Accurate Synthetic Generation of Realistic Personal Information

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
Methodologies for data quality assessment and improvement

ACM Computing Surveys (CSUR)
Entity resolution with iterative blocking

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Geocode Matching and Privacy Preservation

Privacy, Security, and Trust in KDD
Semantic blocking for Record Linkage

Proceedings of the 2007 conference on Artificial Intelligence Research and Development
A Method for Automatic Discovery of Reference Data

IEA/AIE '09 Proceedings of the 22nd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems: Next-Generation Applied Intelligence
Learning blocking schemes for record linkage

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
The Normalized Compression Distance as a Distance Measure in Entity Identification

ICDM '09 Proceedings of the 9th Industrial Conference on Advances in Data Mining. Applications and Theoretical Aspects
Class noise detection using frequent itemsets

Intelligent Data Analysis
A Versatile Record Linkage Method by Term Matching Model Using CRF

DEXA '09 Proceedings of the 20th International Conference on Database and Expert Systems Applications
Creating relational data from unstructured and ungrammatical data sources

Journal of Artificial Intelligence Research
Creating probabilistic databases from duplicated data

The VLDB Journal — The International Journal on Very Large Data Bases
Robust record linkage blocking using suffix arrays

Proceedings of the 18th ACM conference on Information and knowledge management
Empirical case studies in attribute noise detection

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews - Special issue on information reuse and integration
Framework for evaluating clustering algorithms in duplicate detection

Proceedings of the VLDB Endowment
"Same, Same but Different" A Survey on Duplicate Detection Methods for Situation Awareness

OTM '09 Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part II
Entity-aware query processing for heterogeneous data with uncertainty and correlations

Proceedings of the 2009 EDBT/ICDT Workshops
An experimental investigation of the impact of aggregation on the performance of data mining with logistic regression

Information and Management
Fast approximate duplicate detection for 2D-NMR spectra

DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
URI identity management for semantic web data integration and linkage

OTM'07 Proceedings of the 2007 OTM Confederated international conference on On the move to meaningful internet systems - Volume Part II
Querying incomplete data with logic programs: ER strikes back

ER'07 Proceedings of the 26th international conference on Conceptual modeling
Quality-driven query answering for integrated information systems

Quality-driven query answering for integrated information systems
Consistent query answers in inconsistent probabilistic databases

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Reverse ranking query over imprecise spatial data

Proceedings of the 1st International Conference and Exhibition on Computing for Geospatial Research & Application
Querying incomplete data over extended er schemata

Theory and Practice of Logic Programming
Similarity joins as stronger metric operations

SIGSPATIAL Special
From web data to entities and back

CAiSE'10 Proceedings of the 22nd international conference on Advanced information systems engineering
On-the-fly entity-aware query processing in the presence of linkage

Proceedings of the VLDB Endowment
Behavior based record linkage

Proceedings of the VLDB Endowment
Robust Record Linkage Blocking Using Suffix Arrays and Bloom Filters

ACM Transactions on Knowledge Discovery from Data (TKDD)
Approximate entity extraction in temporal databases

World Wide Web
Interaction between record matching and data repairing

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient similarity joins for near-duplicate detection

ACM Transactions on Database Systems (TODS)
Ontology and instance matching

Knowledge-driven multimedia information extraction and ontology evolution
Learning phenotype mapping for integrating large genetic data

BioNLP '11 Proceedings of BioNLP 2011 Workshop
A constraint satisfaction cryptanalysis of bloom filters in private record linkage

PETS'11 Proceedings of the 11th international conference on Privacy enhancing technologies
Dynamic constraints for record matching

The VLDB Journal — The International Journal on Very Large Data Bases
Efficient duplicate detection on cloud using a new signature scheme

WAIM'11 Proceedings of the 12th international conference on Web-age information management
Efficient top-K approximate searches against a relation with multiple attributes

World Wide Web
DWEVOLVE: a requirement based framework for data warehouse evolution

ACM SIGSOFT Software Engineering Notes
Meta similarity

Applied Intelligence
A publication process model to enable privacy-aware data sharing

IBM Journal of Research and Development
Efficient similarity search: arbitrary similarity measures, arbitrary composition

Proceedings of the 20th ACM international conference on Information and knowledge management
Computer-based genealogy reconstruction in founder populations

Journal of Biomedical Informatics
Beauty and the beast: the theory and practice of information integration

ICDT'07 Proceedings of the 11th international conference on Database Theory
Clustering for data matching

KES'06 Proceedings of the 10th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part I
A dictionary-based approach to fast and accurate name matching in large law enforcement databases

ISI'06 Proceedings of the 4th IEEE international conference on Intelligence and Security Informatics
XML duplicate detection using sorted neighborhoods

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Active duplicate detection

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
A precise blocking method for record linkage

DaWaK'05 Proceedings of the 7th international conference on Data Warehousing and Knowledge Discovery
Data cleansing for service-oriented architecture

EC-Web'05 Proceedings of the 6th international conference on E-Commerce and Web Technologies
Brokering multisource data with quality constraints

ODBASE'06/OTM'06 Proceedings of the 2006 Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, DOA, GADA, and ODBASE - Volume Part I
Probabilistic iterative duplicate detection

OTM'05 Proceedings of the 2005 OTM Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, COA, and ODBASE - Volume Part II
Unsupervised duplicate detection using sample non-duplicates

Journal on Data Semantics VII
Similarity function recommender service using incremental user knowledge acquisition

ICSOC'11 Proceedings of the 9th international conference on Service-Oriented Computing
Linking records in dynamic world

PhD '12 Proceedings of the on SIGMOD/PODS 2012 PhD Symposium
Quantifying the correctness, computational complexity, and security of privacy-preserving string comparators for record linkage

Information Fusion
Kd-trees and the real disclosure risks of large statistical databases

Information Fusion
Flexible and efficient distributed resolution of large entities

FoIKS'12 Proceedings of the 7th international conference on Foundations of Information and Knowledge Systems
Entity matching for semistructured data in the Cloud

Proceedings of the 27th Annual ACM Symposium on Applied Computing
Efficient and Practical Approach for Private Record Linkage

Journal of Data and Information Quality (JDIQ)
Improving classifier performance by knowledge-driven data preparation

ICDM'12 Proceedings of the 12th Industrial conference on Advances in Data Mining: applications and theoretical aspects
Matching product titles using web-based enrichment

Proceedings of the 21st ACM international conference on Information and knowledge management
Bagging, bumping, multiview, and active learning for record linkage with empirical results on patient identity data

Computer Methods and Programs in Biomedicine
Map to humans and reduce error: crowdsourcing for deduplication applied to digital libraries

Proceedings of the 21st ACM international conference on Information and knowledge management
Detecting duplicate records in scientific workflow results

IPAW'12 Proceedings of the 4th international conference on Provenance and Annotation of Data and Processes
Efficient privacy-aware record integration

Proceedings of the 16th International Conference on Extending Database Technology
Accuracy of aggregate data in distributed project settings: Model, analysis and implications

Journal of Data and Information Quality (JDIQ)
Keeping it real: utilizing NYC open data in an introduction to database systems course

Journal of Computing Sciences in Colleges
NADEEF: a commodity data cleaning system

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
MFIBlocks: An effective blocking algorithm for entity resolution

Information Systems
A taxonomy of privacy-preserving record linkage techniques

Information Systems
Optimal hashing schemes for entity matching

Proceedings of the 22nd international conference on World Wide Web
A distributed framework for scaling Up LSH-based computations in privacy preserving record linkage

Proceedings of the 6th Balkan Conference in Informatics
Efficient two-party private blocking based on sorted nearest neighborhood clustering

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Tackling the problem of classification with noisy data using Multiple Classifier Systems: Analysis of the performance and robustness

Information Sciences: an International Journal
Verification of query completeness over processes

BPM'13 Proceedings of the 11th international conference on Business Process Management
Linkage of compound objects for supporting maintenance of large-scale web sites

Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
Toward detection of aliases without string similarity

Information Sciences: an International Journal
Efficient entity matching using materialized lists

Information Sciences: an International Journal
Identity matching and information acquisition: Estimation of optimal threshold parameters

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of merging multiple databases of information aboutcommon entities is frequently encountered in KDD and decision supportapplications in large commercial and government organizations. The problemwe study is often called the Merge/Purge problem and is difficult to solveboth in scale and accuracy. Large repositories of data typically havenumerous duplicate information entries about the same entities that aredifficult to cull together without an intelligent ’’equational theory‘‘ thatidentifies equivalent items by a complex, domain-dependent matching process.We have developed a system for accomplishing this Data Cleansing task anddemonstrate its use for cleansing lists of names of potential customers in adirect marketing-type application. Our results for statistically generateddata are shown to be accurate and effective when processing the datamultiple times using different keys for sorting on each successive pass.Combing results of individual passes using transitive closure over theindependent results, produces far more accurate results at lower cost. Thesystem provides a rule programming module that is easy to program and quite good at finding duplicates especially in an environment with massiveamounts of data. This paper details improvements in our system, and reportson the successful implementation for a real-world database that conclusivelyvalidates our results previously achieved for statistically generateddata.