An automatic blocking mechanism for large-scale de-duplication tasks

Authors:
Anish Das Sarma;Ankur Jain;Ashwin Machanavajjhala;Philip Bohannon
Affiliations:
Google Research, Mountain View, CA, USA;Yahoo! Research, Santa Clara, CA, USA;Duke University, Durham, NC, USA;Facebook, Palo Alto, CA, USA
Venue:
Proceedings of the 21st ACM international conference on Information and knowledge management
Year:
2012

Citing 16
Cited 1

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient Record Linkage in Large Data Sets

DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
Record linkage: similarity measures and algorithms

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Adaptive Blocking: Learning to Scale Up Record Linkage

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
The history of histograms (abridged)

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Parallel linkage

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
Entity resolution with iterative blocking

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Learning blocking schemes for record linkage

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Scaling record linkage to non-uniform distributed class sizes

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Processing theta-joins using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

IEEE Transactions on Knowledge and Data Engineering

Optimal hashing schemes for entity matching

Proceedings of the 22nd international conference on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

De-duplication - identification of distinct records referring to the same real-world entity - is a well-known challenge in data integration. Since very large datasets prohibit the comparison of every pair of records, blocking has been identified as a technique of dividing the dataset for pairwise comparisons, thereby trading off recall of identified duplicates for efficiency. Traditional de-duplication tasks, while challenging, typically involved a fixed schema such as Census data or medical records. However, with the presence of large, diverse sets of structured data on the web and the need to organize it effectively on content portals, de-duplication systems need to scale in a new dimension to handle a large number of schemas, tasks and data sets, while handling ever larger problem sizes. In addition, when working in a map-reduce framework it is important that canopy formation be implemented as a hash function, making the canopy design problem more challenging. We present CBLOCK, a system that addresses these challenges. CBLOCK learns hash functions automatically from attribute domains and a labeled dataset consisting of duplicates. Subsequently, CBLOCK expresses blocking functions using a hierarchical tree structure composed of atomic hash functions. The application may guide the automated blocking process based on architectural constraints, such as by specifying a maximum size of each block (based on memory requirements), impose disjointness of blocks (in a grid environment), or specify a particular objective function trading off recall for efficiency. As a post-processing step to automatically generated blocks, CBLOCK rolls-up smaller blocks to increase recall. We present experimental results on two large-scale de-duplication datasets from a commercial search engine - consisting of over 140K movies and 40K restaurants respectively - and demonstrate the utility of CBLOCK.