Optimal aggregation algorithms for middleware
PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Learning object identification rules for information integration
Information Systems - Data extraction, cleaning and reconciliation
Minimal probing: supporting expensive predicates for top-k queries
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Reference reconciliation in complex information spaces
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Optimizing top-k queries for middleware access: A unified cost-based approach
ACM Transactions on Database Systems (TODS)
Swoosh: a generic approach to entity resolution
The VLDB Journal — The International Journal on Very Large Data Bases
Frameworks for entity matching: A comparison
Data & Knowledge Engineering
Evaluating entity resolution results
Proceedings of the VLDB Endowment
Evaluation of entity resolution approaches on real-world match problems
Proceedings of the VLDB Endowment
Entity resolution with evolving rules
Proceedings of the VLDB Endowment
Efficient entity matching using materialized lists
Information Sciences: an International Journal
Incremental entity resolution on rules and data
The VLDB Journal — The International Journal on Very Large Data Bases
Hi-index | 0.01 |
Entity matching (EM) is the task of identifying records that refer to the same real-world entity from different data sources. While EM is widely used in data integration and data cleaning applications, the naive method for EM incurs quadratic cost with respect to the size of the datasets. To address this problem, this paper proposes a scalable EM algorithm that employs a pre-materialized structure. Specifically, once the structure is built, our proposed algorithm can identify the EM results with sub-linear cost. In addition, as the rules evolve, our algorithm can efficiently adapt to new rules by selectively accessing records using the materialized structure. Our evaluation results show that our proposed EM algorithm is significantly faster than the state-of-the-art method for extensive real-life datasets.