Adaptive Blocking: Learning to Scale Up Record Linkage

Authors:
Mikhail Bilenko;Beena Kamath;Raymond J. Mooney
Affiliations:
One Microsoft Way, USA;Google Inc., USA;University of Texas at Austin, USA
Venue:
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Year:
2006

Citing 0
Cited 34

Example-driven design of efficient record matching queries

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Multirelational classification: a multiple view approach

Knowledge and Information Systems
Efficient top-k count queries over imprecise duplicates

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Author name disambiguation in MEDLINE

ACM Transactions on Knowledge Discovery from Data (TKDD)
Disambiguating authors in academic publications using random forests

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Entity resolution with iterative blocking

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Creating relational data from unstructured and ungrammatical data sources

Journal of Artificial Intelligence Research
Frameworks for entity matching: A comparison

Data & Knowledge Engineering
Learning similarity metrics for event identification in social media

Proceedings of the third ACM international conference on Web search and data mining
Scaling record linkage to non-uniform distributed class sizes

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
On active learning of record matching packages

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A new method for matching objects in two different geospatial datasets based on the geographic context

Computers & Geosciences
Efficient entity resolution for large heterogeneous information spaces

Proceedings of the fourth ACM international conference on Web search and data mining
Eliminating the redundancy in blocking-based entity resolution methods

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
To compare or not to compare: making entity resolution more efficient

Proceedings of the International Workshop on Semantic Web Information Management
A supervised machine learning approach for duplicate detection over gazetteer records

GeoS'11 Proceedings of the 4th international conference on GeoSpatial semantics
Introduction to linked data and its lifecycle on the web

RW'11 Proceedings of the 7th international conference on Reasoning web: semantic technologies for the web of data
Efficient similarity search: arbitrary similarity measures, arbitrary composition

Proceedings of the 20th ACM international conference on Information and knowledge management
Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data

Proceedings of the fifth ACM international conference on Web search and data mining
Leveraging unlabeled data to scale blocking for record linkage

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Entity resolution: theory, practice & open challenges

Proceedings of the VLDB Endowment
A discriminative hierarchical model for fast coreference at large scale

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
An automatic blocking mechanism for large-scale de-duplication tasks

Proceedings of the 21st ACM international conference on Information and knowledge management
Detecting duplicate records in scientific workflow results

IPAW'12 Proceedings of the 4th international conference on Provenance and Annotation of Data and Processes
TYPiMatch: type-specific unsupervised learning of keys and key values for heterogeneous web data integration

Proceedings of the sixth ACM international conference on Web search and data mining
Cost-aware query planning for similarity search

Information Systems
Towards scalable real-time entity resolution using a similarity-aware inverted index approach

AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87
MFIBlocks: An effective blocking algorithm for entity resolution

Information Systems
Efficient XML duplicate detection using an adaptive two-level optimization

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Exploiting user clicks for automatic seed set generation for entity matching

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Optimal hashing schemes for entity matching

Proceedings of the 22nd international conference on World Wide Web
An automatic blocking strategy for XML duplicate detection

ACM SIGAPP Applied Computing Review
A joint model for discovering and linking entities

Proceedings of the 2013 workshop on Automated knowledge base construction
Introduction to linked data and its lifecycle on the web

RW'13 Proceedings of the 9th international conference on Reasoning Web: semantic technologies for intelligent data access

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many data mining tasks require computing similarity between pairs of objects. Pairwise similarity computations are particularly important in record linkage systems, as well as in clustering and schema mapping algorithms. Because the number of object pairs grows quadratically with the size of the dataset, computing similarity between all pairs is impractical and becomes prohibitive for large datasets and complex similarity functions. Blocking methods alleviate this problem by efficiently selecting approximately similar object pairs for subsequent distance computations, leaving out the remaining pairs as dissimilar. Previously proposed blocking methods require manually constructing an index-based similarity function or selecting a set of predicates, followed by hand-tuning of parameters. In this paper, we introduce an adaptive framework for automatically learning blocking functions that are efficient and accurate. We describe two predicate-based formulations of learnable blocking functions and provide learning algorithms for training them. The effectiveness of the proposed techniques is demonstrated on real and simulated datasets, on which they prove to be more accurate than non-adaptive blocking methods.