Scaling up duplicate detection in graph data

Authors:
Melanie Herschel;Felix Naumann
Affiliations:
Hasso-Plattner-Institut, Potsdam, Germany;Hasso-Plattner-Institut, Potsdam, Germany
Venue:
Proceedings of the 17th ACM conference on Information and knowledge management
Year:
2008

Citing 7
Cited 3

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Domain-independent data cleaning via analysis of entity-relationship graph

ACM Transactions on Database Systems (TODS)
LinkClus: efficient clustering via heterogeneous semantic links

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Object identification with attribute-mediated dependences

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases

BibBase triplified

Proceedings of the 6th International Conference on Semantic Systems
Adaptive Connection Strength Models for Relationship-Based Entity Resolution

Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution
A supervised learning and group linking method for historical census household linkage

AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121

Quantified Score

Hi-index	0.00

Visualization

Abstract

Duplicate detection determines different representations of real-world objects in a database. Recent research has considered the use of relationships among object representations to improve duplicate detection. In the general case where relationships form a graph, research has mainly focused on duplicate detection quality/effectiveness. Scalability has been neglected so far, even though it is crucial for large real-world duplicate detection tasks. We scale up duplicate detection in graph data (DDG) to large amounts of data using the support of a relational database system. We first generalize the process of DDG and then present how to scale DDG in space (amount of data processed with limited main memory) and in time. Finally, we explore how complex similarity computation can be performed efficiently. Experiments on data an order of magnitude larger than data considered so far in DDG clearly show that our methods scale to large amounts of data.