Entity resolution with evolving rules

Authors:
Steven Euijong Whang;Hector Garcia-Molina
Affiliations:
Stanford University, Stanford, CA;Stanford University, Stanford, CA
Venue:
Proceedings of the VLDB Endowment
Year:
2010

Citing 17
Cited 6

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Implementing data cubes efficiently

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Incremental clustering and dynamic information retrieval

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Data clustering: a review

ACM Computing Surveys (CSUR)
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Record linkage: making maximum use of the discriminating power of identifying information

Communications of the ACM
A small approximately min-wise independent family of hash functions

Journal of Algorithms
Optimizing Queries with Materialized Views

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Iterative record linkage for cleaning and integration

Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Robust Identification of Fuzzy Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
A framework for clustering evolving data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Introduction to Information Retrieval

Introduction to Information Retrieval
Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
Entity resolution with iterative blocking

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Generic entity resolution with negative rules

The VLDB Journal — The International Journal on Very Large Data Bases
Schema AND Data: A Holistic Approach to Mapping, Resolution and Fusion in Information Integration

ER '09 Proceedings of the 28th International Conference on Conceptual Modeling

Scalable entity matching computation with materialization

Proceedings of the 20th ACM international conference on Information and knowledge management
Flexible and efficient distributed resolution of large entities

FoIKS'12 Proceedings of the 7th international conference on Foundations of Information and Knowledge Systems
Query-driven approach to entity resolution

Proceedings of the VLDB Endowment
Efficient entity matching using materialized lists

Information Sciences: an International Journal
Incremental entity resolution on rules and data

The VLDB Journal — The International Journal on Very Large Data Bases
Joint entity resolution on multiple datasets

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Entity resolution (ER) identifies database records that refer to the same real world entity. In practice, ER is not a one-time process, but is constantly improved as the data, schema and application are better understood. We address the problem of keeping the ER result up-to-date when the ER logic "evolves" frequently. A naïve approach that re-runs ER from scratch may not be tolerable for resolving large datasets. This paper investigates when and how we can instead exploit previous "materialized" ER results to save redundant work with evolved logic. We introduce algorithm properties that facilitate evolution, and we propose efficient rule evolution techniques for two clustering ER models: match-based clustering and distance-based clustering. Using real data sets, we illustrate the cost of materializations and the potential gains over the naïve approach.