Flexible and efficient distributed resolution of large entities

Authors:
András J. Molnár;András A. Benczúr;Csaba István Sidló
Affiliations:
Data Mining and Web Search Group, Informatics Laboratory Institute for Computer Science and Control, Hungarian Academy of Sciences, Hungary;Data Mining and Web Search Group, Informatics Laboratory Institute for Computer Science and Control, Hungarian Academy of Sciences, Hungary;Data Mining and Web Search Group, Informatics Laboratory Institute for Computer Science and Control, Hungarian Academy of Sciences, Hungary
Venue:
FoIKS'12 Proceedings of the 7th international conference on Foundations of Information and Knowledge Systems
Year:
2012

Citing 36
Cited 0

Introduction to algorithms

Introduction to algorithms
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A hierarchical naive Bayes mixture model for name disambiguation in author citations

Proceedings of the 2005 ACM symposium on Applied computing
Link mining: a survey

ACM SIGKDD Explorations Newsletter
Query-time entity resolution

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Leveraging aggregate constraints for deduplication

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution

ICDCS '07 Proceedings of the 27th International Conference on Distributed Computing Systems
Parallel linkage

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Structured entity identification and document categorization: two tasks with one joint model

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic record linkage using seeded nearest neighbour and support vector machine classification

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Unsupervised deduplication using cross-field dependencies

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
A unified approach for schema matching, coreference and canonicalization

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient Closed Pattern Mining in Strongly Accessible Set Systems (Extended Abstract)

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Industry-scale duplicate detection

Proceedings of the VLDB Endowment
Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
Entity resolution with iterative blocking

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Generic Entity Resolution in Relational Databases

ADBIS '09 Proceedings of the 13th East European Conference on Advances in Databases and Information Systems
Using decision trees for conference resolution

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Similarity-aware indexing for real-time entity resolution

Proceedings of the 18th ACM conference on Information and knowledge management
Frameworks for entity matching: A comparison

Data & Knowledge Engineering
PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Bed-tree: an all-purpose index structure for string similarity search based on edit distance

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Evaluating entity resolution results

Proceedings of the VLDB Endowment
Record linkage with uniqueness constraints and erroneous values

Proceedings of the VLDB Endowment
Behavior based record linkage

Proceedings of the VLDB Endowment
Evaluation of entity resolution approaches on real-world match problems

Proceedings of the VLDB Endowment
Entity resolution with evolving rules

Proceedings of the VLDB Endowment
Entity Resolution and Information Quality

Entity Resolution and Information Quality
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

IEEE Transactions on Knowledge and Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Entity resolution (ER) is a computationally hard problem of data integration scenarios, where database records have to be grouped according to the real-world entities they belong to. In practice these entities may consist of only a few records from different data sources with typos or historical data. In other cases they may contain significantly more records, especially when we search for entities on a higher level of a concept hierarchy than records. In this paper we give theoretical foundation of a variety of practically important match functions. We show that under these formulations, ER with large entities can be solved efficiently with algorithms based on MapReduce, a distributed computing paradigm. Our algorithm can efficiently incorporate probabilistic and similarity-based record match, enabling flexible match function definition. We demonstrate the usability of our model and algorithm in a real-world insurance ER scenario, where we identify household groups of client records.