Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data

Authors:
George Papadakis;Ekaterini Ioannou;Claudia Niederée;Themis Palpanas;Wolfgang Nejdl
Affiliations:
National Technical University of Athens & L3S Research Center, Athens, Greece;Technical University of Crete, Chania, Greece;L3S Research Center, Hanover, Germany;University of Trento, Trento, Italy;L3S Research Center & Leibniz University of Hanover, Hanover, Germany
Venue:
Proceedings of the fifth ACM international conference on Web search and data mining
Year:
2012

Citing 21
Cited 4

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Learning domain-independent string transformation weights for high accuracy object identification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Principles of dataspace systems

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Adaptive Blocking: Learning to Scale Up Record Linkage

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Supporting Efficient Record Linkage for Large Data Sets Using Mapping Techniques

World Wide Web
Entity resolution with iterative blocking

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Learning blocking schemes for record linkage

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Robust record linkage blocking using suffix arrays

Proceedings of the 18th ACM conference on Information and knowledge management
A framework for semantic link discovery over relational data

Proceedings of the 18th ACM conference on Information and knowledge management
HARRA: fast iterative hashed record linkage for large-scale data collections

Proceedings of the 13th International Conference on Extending Database Technology
On-the-fly entity-aware query processing in the presence of linkage

Proceedings of the VLDB Endowment
Efficient entity resolution for large heterogeneous information spaces

Proceedings of the fourth ACM international conference on Web search and data mining
Large-scale collective entity matching

Proceedings of the VLDB Endowment
The missing links: discovering hidden same-as links among a billion of triples

Proceedings of the 12th International Conference on Information Integration and Web-based Applications & Services
Eliminating the redundancy in blocking-based entity resolution methods

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
To compare or not to compare: making entity resolution more efficient

Proceedings of the International Workshop on Semantic Web Information Management
A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

IEEE Transactions on Knowledge and Data Engineering

dbTrento: the data and information management group at the University of Trento

ACM SIGMOD Record
LINDA: distributed web-of-data-scale entity matching

Proceedings of the 21st ACM international conference on Information and knowledge management
NADEEF: a commodity data cleaning system

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Large-scale linked data integration using probabilistic reasoning and crowdsourcing

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

A prerequisite for leveraging the vast amount of data available on the Web is Entity Resolution, i.e., the process of identifying and linking data that describe the same real-world objects. To make this inherently quadratic process applicable to large data sets, blocking is typically employed: entities (records) are grouped into clusters - the blocks - of matching candidates and only entities of the same block are compared. However, novel blocking techniques are required for dealing with the noisy, heterogeneous, semi-structured, user-generateddata in the Web, as traditional blocking techniques are inapplicable due to their reliance on schema information. The introduction of redundancy, improves the robustness of blocking methods but comes at the price of additional computational cost. In this paper, we present methods for enhancing the efficiency of redundancy-bearing blocking methods, such as our attribute-agnostic blocking approach. We introduce novel blocking schemes that build blocks based on a variety of evidences, including entity identifiers and relationships between entities; they significantly reduce the required number of comparisons, while maintaining blocking effectiveness at very high levels. We also introduce two theoretical measures that provide a reliable estimation of the performance of a blocking method, without requiring the analytical processing of its blocks. Based on these measures, we develop two techniques for improving the performance of blocking: combining individual, complementary blocking schemes, and purging blocks until given criteria are satisfied. We test our methods through an extensive experimental evaluation, using a voluminous data set with 182 million heterogeneous entities. The outcomes of our study show the applicability and the high performance of our approach.