Efficient entity matching using materialized lists

Authors:
Sanghoon Lee;Jongwuk Lee;Seung-Won Hwang
Affiliations:
-;-;-
Venue:
Information Sciences: an International Journal
Year:
2014

Citing 22
Cited 0

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Optimal aggregation algorithms for middleware

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Static index pruning for information retrieval systems

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Learning object identification rules for information integration

Information Systems - Data extraction, cleaning and reconciliation
Minimal probing: supporting expensive predicates for top-k queries

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
The Skyline Operator

Proceedings of the 17th International Conference on Data Engineering
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Iterative record linkage for cleaning and integration

Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Progressive skyline computation in database systems

ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2003
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Optimizing top-k queries for middleware access: A unified cost-based approach

ACM Transactions on Database Systems (TODS)
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Efficient Skyline and Top-k Retrieval in Subspaces

IEEE Transactions on Knowledge and Data Engineering
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
Dominant Graph: An Efficient Indexing Structure to Answer Top-K Queries

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Evaluation of entity resolution approaches on real-world match problems

Proceedings of the VLDB Endowment
Entity resolution with evolving rules

Proceedings of the VLDB Endowment
Scalable entity matching computation with materialization

Proceedings of the 20th ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.07

Visualization

Abstract

Entity matching (EM) is the task of identifying records that refer to the same entity from different sources. EM is widely used in real-world applications such as data integration and data cleaning, but the naive method of EM leads to exhaustive pair-wise comparisons. To enhance the efficiency of EM, we transform EM into the top-k query problem of identifying the best k results for a given match function, and propose a new EM algorithm using pre-materialized lists, which refer to the sorted lists of record pairs. Our proposed algorithm identifies the EM results with sub-linear cost using the materialized lists. Because it requires us to materialize the sorted lists with all record pairs, however, this approach can be impractical. To address this problem, we reduce the size of the materialized lists, which stores only 1% of all pairs without sacrificing EM accuracy. This method is inspired by the notion of skyline queries. In addition, we extend our proposed framework to collective entity matching that exploits both attributes and the reference relationships across records. Experimental results show that the proposed algorithms are an order of magnitude faster than the state-of-the-art algorithms without compromising accuracy.