Scalable entity matching computation with materialization

Authors:
Sanghoon Lee;Jongwuk Lee;Seung-won Hwang
Affiliations:
Pohang University of Science and Technology (POSTECH), Pohang, South Korea;Pohang University of Science and Technology (POSTECH), Pohang, South Korea;Pohang University of Science and Technology (POSTECH), Pohang, South Korea
Venue:
Proceedings of the 20th ACM international conference on Information and knowledge management
Year:
2011

Citing 10
Cited 2

Optimal aggregation algorithms for middleware

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Learning object identification rules for information integration

Information Systems - Data extraction, cleaning and reconciliation
Minimal probing: supporting expensive predicates for top-k queries

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Optimizing top-k queries for middleware access: A unified cost-based approach

ACM Transactions on Database Systems (TODS)
Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
Frameworks for entity matching: A comparison

Data & Knowledge Engineering
Evaluating entity resolution results

Proceedings of the VLDB Endowment
Evaluation of entity resolution approaches on real-world match problems

Proceedings of the VLDB Endowment
Entity resolution with evolving rules

Proceedings of the VLDB Endowment

Efficient entity matching using materialized lists

Information Sciences: an International Journal
Incremental entity resolution on rules and data

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.01

Visualization

Abstract

Entity matching (EM) is the task of identifying records that refer to the same real-world entity from different data sources. While EM is widely used in data integration and data cleaning applications, the naive method for EM incurs quadratic cost with respect to the size of the datasets. To address this problem, this paper proposes a scalable EM algorithm that employs a pre-materialized structure. Specifically, once the structure is built, our proposed algorithm can identify the EM results with sub-linear cost. In addition, as the rules evolve, our algorithm can efficiently adapt to new rules by selectively accessing records using the materialized structure. Our evaluation results show that our proposed EM algorithm is significantly faster than the state-of-the-art method for extensive real-life datasets.