The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Domain-independent data cleaning via analysis of entity-relationship graph
ACM Transactions on Database Systems (TODS)
LinkClus: efficient clustering via heterogeneous semantic links
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Collective entity resolution in relational data
ACM Transactions on Knowledge Discovery from Data (TKDD)
Object identification with attribute-mediated dependences
PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
Proceedings of the 6th International Conference on Semantic Systems
Adaptive Connection Strength Models for Relationship-Based Entity Resolution
Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution
A supervised learning and group linking method for historical census household linkage
AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121
Hi-index | 0.00 |
Duplicate detection determines different representations of real-world objects in a database. Recent research has considered the use of relationships among object representations to improve duplicate detection. In the general case where relationships form a graph, research has mainly focused on duplicate detection quality/effectiveness. Scalability has been neglected so far, even though it is crucial for large real-world duplicate detection tasks. We scale up duplicate detection in graph data (DDG) to large amounts of data using the support of a relational database system. We first generalize the process of DDG and then present how to scale DDG in space (amount of data processed with limited main memory) and in time. Finally, we explore how complex similarity computation can be performed efficiently. Experiments on data an order of magnitude larger than data considered so far in DDG clearly show that our methods scale to large amounts of data.