An automatic blocking strategy for XML duplicate detection

Authors:
Luís Leitão;Pável Calado
Affiliations:
IST/INESC-ID, Porto Salvo, Portugal;IST/INESC-ID, Porto Salvo, Portugal
Venue:
ACM SIGAPP Applied Computing Review
Year:
2013

Citing 17
Cited 0

Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
The nature of statistical learning theory

The nature of statistical learning theory
The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Making large-scale support vector machine learning practical

Advances in kernel methods
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Automating the approximate record-matching process

Information Sciences—Informatics and Computer Science: An International Journal
Modern Information Retrieval

Modern Information Retrieval
Efficient data reconciliation

Information Sciences: an International Journal
DogmatiX tracks down duplicates in XML

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Adaptive Blocking: Learning to Scale Up Record Linkage

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Structure-based inference of xml similarity for fuzzy duplicate detection

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Accurate Synthetic Generation of Realistic Personal Information

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Duplicate detection through structure optimization

Proceedings of the 20th ACM international conference on Information and knowledge management
XML duplicate detection using sorted neighborhoods

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

IEEE Transactions on Knowledge and Data Engineering
Efficient and Effective Duplicate Detection in Hierarchical Data

IEEE Transactions on Knowledge and Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Duplicate detection consists in finding objects that, although having different representations in a database, correspond to the same real world entity. This is typically achieved by comparing all objects to each other, which can be unfeasible for large datasets. Blocking strategies have been devised to reduce the number of objects to compare, at the cost of loosing some duplicates. However, these strategies typically rely on user knowledge to discover a set of parameters that optimize the comparisons, while minimizing the loss. Also, they do not usually optimize the comparison between each pair of objects. In this paper, we propose a blocking method of combining two optimization strategies: one to select which objects to compare and another to optimize pair-wise object comparisons. In addition, we propose a machine learning approach to determine the required parameters, without the need of user intervention. Experiments performed on several datasets show that not only we are able to effectively determine the optimization parameters, but also to significantly improve efficiency, while maintaining an acceptable loss of recall.