The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Optimal aggregation algorithms for middleware
PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Static index pruning for information retrieval systems
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Learning object identification rules for information integration
Information Systems - Data extraction, cleaning and reconciliation
Minimal probing: supporting expensive predicates for top-k queries
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
Proceedings of the 17th International Conference on Data Engineering
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Iterative record linkage for cleaning and integration
Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Progressive skyline computation in database systems
ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2003
Reference reconciliation in complex information spaces
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Optimizing top-k queries for middleware access: A unified cost-based approach
ACM Transactions on Database Systems (TODS)
Collective entity resolution in relational data
ACM Transactions on Knowledge Discovery from Data (TKDD)
Efficient Skyline and Top-k Retrieval in Subspaces
IEEE Transactions on Knowledge and Data Engineering
Eliminating fuzzy duplicates in data warehouses
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Swoosh: a generic approach to entity resolution
The VLDB Journal — The International Journal on Very Large Data Bases
Dominant Graph: An Efficient Indexing Structure to Answer Top-K Queries
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Evaluation of entity resolution approaches on real-world match problems
Proceedings of the VLDB Endowment
Entity resolution with evolving rules
Proceedings of the VLDB Endowment
Scalable entity matching computation with materialization
Proceedings of the 20th ACM international conference on Information and knowledge management
Hi-index | 0.07 |
Entity matching (EM) is the task of identifying records that refer to the same entity from different sources. EM is widely used in real-world applications such as data integration and data cleaning, but the naive method of EM leads to exhaustive pair-wise comparisons. To enhance the efficiency of EM, we transform EM into the top-k query problem of identifying the best k results for a given match function, and propose a new EM algorithm using pre-materialized lists, which refer to the sorted lists of record pairs. Our proposed algorithm identifies the EM results with sub-linear cost using the materialized lists. Because it requires us to materialize the sorted lists with all record pairs, however, this approach can be impractical. To address this problem, we reduce the size of the materialized lists, which stores only 1% of all pairs without sacrificing EM accuracy. This method is inspired by the notion of skyline queries. In addition, we extend our proposed framework to collective entity matching that exploits both attributes and the reference relationships across records. Experimental results show that the proposed algorithms are an order of magnitude faster than the state-of-the-art algorithms without compromising accuracy.