Framework for evaluating clustering algorithms in duplicate detection

Authors:
Oktie Hassanzadeh;Fei Chiang;Hyun Chul Lee;Renée J. Miller
Affiliations:
University of Toronto;University of Toronto;Thoora Inc.;University of Toronto
Venue:
Proceedings of the VLDB Endowment
Year:
2009

Citing 29
Cited 10

Algorithms for clustering data

Algorithms for clustering data
A new approach to the maximum-flow problem

Journal of the ACM (JACM)
Introduction to algorithms

Introduction to algorithms
Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems

Journal of the ACM (JACM)
Data clustering: a review

ACM Computing Surveys (CSUR)
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
On Clustering Validation Techniques

Journal of Intelligent Information Systems
X-means: Extending K-means with Efficient Estimation of the Number of Clusters

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Hardness of Approximating Minimization Problems

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Efficient Graph-Based Image Segmentation

International Journal of Computer Vision
Correlation Clustering: maximizing agreements via semidefinite programming

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
On clusterings: Good, bad and spectral

Journal of the ACM (JACM)
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Correlation Clustering

Machine Learning
Robust Identification of Fuzzy Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Aggregating inconsistent information: ranking and clustering

Proceedings of the thirty-seventh annual ACM symposium on Theory of computing
Discovering large dense subgraphs in massive graphs

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Scalable clustering of categorical data and applications

Scalable clustering of categorical data and applications
Clustering with qualitative information

Journal of Computer and System Sciences - Special issue: Learning theory 2003
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Correlation clustering in general weighted graphs

Theoretical Computer Science - Approximation and online algorithms
Introduction to Clustering Large and High-Dimensional Data

Introduction to Clustering Large and High-Dimensional Data
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Finding near neighbors through cluster pruning

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A survey of kernel and spectral methods for clustering

Pattern Recognition
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Seeking stable clusters in the blogosphere

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Creating probabilistic databases from duplicated data

The VLDB Journal — The International Journal on Very Large Data Bases
Survey of clustering algorithms

IEEE Transactions on Neural Networks

Frameworks for entity matching: A comparison

Data & Knowledge Engineering
Ranking of evolving stories through meta-aggregation

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Evaluating entity resolution results

Proceedings of the VLDB Endowment
Evaluation of entity resolution approaches on real-world match problems

Proceedings of the VLDB Endowment
Helix: online enterprise data analytics

Proceedings of the 20th international conference companion on World wide web
Linking records in dynamic world

PhD '12 Proceedings of the on SIGMOD/PODS 2012 PhD Symposium
Chronos: facilitating history discovery by linking temporal records

Proceedings of the VLDB Endowment
Evaluating indeterministic duplicate detection results

SUM'12 Proceedings of the 6th international conference on Scalable Uncertainty Management
Indeterministic Handling of Uncertain Decisions in Deduplication

Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution
Question selection for crowd entity resolution

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

The presence of duplicate records is a major data quality concern in large databases. To detect duplicates, entity resolution also known as duplication detection or record linkage is used as a part of the data cleaning process to identify records that potentially refer to the same real-world entity. We present the Stringer system that provides an evaluation framework for understanding what barriers remain towards the goal of truly scalable and general purpose duplication detection algorithms. In this paper, we use Stringer to evaluate the quality of the clusters (groups of potential duplicates) obtained from several unconstrained clustering algorithms used in concert with approximate join techniques. Our work is motivated by the recent significant advancements that have made approximate join algorithms highly scalable. Our extensive evaluation reveals that some clustering algorithms that have never been considered for duplicate detection, perform extremely well in terms of both accuracy and scalability.