Incremental entity resolution on rules and data

Authors:
Steven Euijong Whang;Hector Garcia-Molina
Affiliations:
Computer Science Department, Stanford University, Stanford, USA 94305 and Google Inc., Mountain View, USA;Computer Science Department, Stanford University, Stanford, USA 94305
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2014

Citing 27
Cited 0

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Implementing data cubes efficiently

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Incremental clustering and dynamic information retrieval

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Data clustering: a review

ACM Computing Surveys (CSUR)
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Record linkage: making maximum use of the discriminating power of identifying information

Communications of the ACM
A small approximately min-wise independent family of hash functions

Journal of Algorithms
Optimizing queries using materialized views: a practical, scalable solution

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Learning object identification rules for information integration

Information Systems - Data extraction, cleaning and reconciliation
Optimizing Queries with Materialized Views

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust Identification of Fuzzy Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
A framework for clustering evolving data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Introduction to Information Retrieval

Introduction to Information Retrieval
Data Clustering: 50 Years Beyond K-means

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
Entity resolution with iterative blocking

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Generic entity resolution with negative rules

The VLDB Journal — The International Journal on Very Large Data Bases
Schema AND Data: A Holistic Approach to Mapping, Resolution and Fusion in Information Integration

ER '09 Proceedings of the 28th International Conference on Conceptual Modeling
On active learning of record matching packages

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Evaluating entity resolution results

Proceedings of the VLDB Endowment
Entity resolution with evolving rules

Proceedings of the VLDB Endowment
Scalable entity matching computation with materialization

Proceedings of the 20th ACM international conference on Information and knowledge management
Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection

Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection

Quantified Score

Hi-index	0.00

Visualization

Abstract

Entity resolution (ER) identifies database records that refer to the same real-world entity. In practice, ER is not a one-time process, but is constantly improved as the data, schema and application are better understood. We first address the problem of keeping the ER result up-to-date when the ER logic or data "evolve" frequently. A naïve approach that re-runs ER from scratch may not be tolerable for resolving large datasets. This paper investigates when and how we can instead exploit previous "materialized" ER results to save redundant work with evolved logic and data. We introduce algorithm properties that facilitate evolution, and we propose efficient rule and data evolution techniques for three ER models: match-based clustering (records are clustered based on Boolean matching information), distance-based clustering (records are clustered based on relative distances), and pairs ER (the pairs of matching records are identified). Using real datasets, we illustrate the cost of materializations and the potential gains of evolution over the naïve approach.