Optimal hashing schemes for entity matching

Authors:
Nilesh Dalvi;Vibhor Rastogi;Anirban Dasgupta;Anish Das Sarma;Tamas Sarlos
Affiliations:
Facebook, Menlo Park, CA, USA;Google, Mountain View, CA, USA;Yahoo!, Sunnyvale, CA, USA;Google, New York, NY, USA;Google, Mountain View, CA, USA
Venue:
Proceedings of the 22nd international conference on World Wide Web
Year:
2013

Citing 20
Cited 0

Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Sampling algorithms: lower bounds and applications

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Towards a robust query optimizer: a principled and practical approach

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Creating probabilistic databases from information extraction models

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Adaptive Blocking: Learning to Scale Up Record Linkage

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Entity Resolution with Markov Logic

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Pairwise Independence and Derandomization (Foundations and Trends(R) in Theoretical Computer Science)

Pairwise Independence and Derandomization (Foundations and Trends(R) in Theoretical Computer Science)
Spatial join techniques

ACM Transactions on Database Systems (TODS)
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Unsupervised deduplication using cross-field dependencies

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning blocking schemes for record linkage

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
On active learning of record matching packages

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Scalable similarity search with optimized kernel hashing

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
An automatic blocking mechanism for large-scale de-duplication tasks

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we consider the problem of devising blocking schemes for entity matching. There is a lot of work on blocking techniques for supporting various kinds of predicates, e.g. exact matches, fuzzy string-similarity matches, and spatial matches. However, given a complex entity matching function in the form of a Boolean expression over several such predicates, we show that it is an important and non-trivial problem to combine the individual blocking techniques into an efficient blocking scheme for the entity matching function, a problem that has not been studied previously. In this paper, we make fundamental contributions to this problem. We consider an abstraction for modeling complex entity matching functions as well as blocking schemes. We present several results of theoretical and practical interest for the problem. We show that in general, the problem of computing the optimal blocking strategy is NP-hard in the size of the DNF formula describing the matching function. We also present several algorithms for computing the exact optimal strategies (with exponential complexity, but often feasible in practice) as well as fast approximation algorithms. We experimentally demonstrate over commercially used rule-based matching systems over real datasets at Yahoo!, as well as synthetic datasets, that our blocking strategies can be an order of magnitude faster than the baseline methods, and our algorithms can efficiently find good blocking strategies.