Algorithms for clustering data
Algorithms for clustering data
A new approach to the maximum-flow problem
Journal of the ACM (JACM)
Introduction to algorithms
Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems
Journal of the ACM (JACM)
ACM Computing Surveys (CSUR)
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
On Clustering Validation Techniques
Journal of Intelligent Information Systems
X-means: Extending K-means with Efficient Estimation of the Number of Clusters
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Hardness of Approximating Minimization Problems
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Efficient Graph-Based Image Segmentation
International Journal of Computer Vision
Correlation Clustering: maximizing agreements via semidefinite programming
SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
On clusterings: Good, bad and spectral
Journal of the ACM (JACM)
Efficient set joins on similarity predicates
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Machine Learning
Robust Identification of Fuzzy Duplicates
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Aggregating inconsistent information: ranking and clustering
Proceedings of the thirty-seventh annual ACM symposium on Theory of computing
Discovering large dense subgraphs in massive graphs
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Scalable clustering of categorical data and applications
Scalable clustering of categorical data and applications
Clustering with qualitative information
Journal of Computer and System Sciences - Special issue: Learning theory 2003
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Correlation clustering in general weighted graphs
Theoretical Computer Science - Approximation and online algorithms
Introduction to Clustering Large and High-Dimensional Data
Introduction to Clustering Large and High-Dimensional Data
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
Finding near neighbors through cluster pruning
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A survey of kernel and spectral methods for clustering
Pattern Recognition
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Seeking stable clusters in the blogosphere
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Creating probabilistic databases from duplicated data
The VLDB Journal — The International Journal on Very Large Data Bases
Survey of clustering algorithms
IEEE Transactions on Neural Networks
Frameworks for entity matching: A comparison
Data & Knowledge Engineering
Ranking of evolving stories through meta-aggregation
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Evaluating entity resolution results
Proceedings of the VLDB Endowment
Evaluation of entity resolution approaches on real-world match problems
Proceedings of the VLDB Endowment
Helix: online enterprise data analytics
Proceedings of the 20th international conference companion on World wide web
Linking records in dynamic world
PhD '12 Proceedings of the on SIGMOD/PODS 2012 PhD Symposium
Chronos: facilitating history discovery by linking temporal records
Proceedings of the VLDB Endowment
Evaluating indeterministic duplicate detection results
SUM'12 Proceedings of the 6th international conference on Scalable Uncertainty Management
Indeterministic Handling of Uncertain Decisions in Deduplication
Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution
Question selection for crowd entity resolution
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
The presence of duplicate records is a major data quality concern in large databases. To detect duplicates, entity resolution also known as duplication detection or record linkage is used as a part of the data cleaning process to identify records that potentially refer to the same real-world entity. We present the Stringer system that provides an evaluation framework for understanding what barriers remain towards the goal of truly scalable and general purpose duplication detection algorithms. In this paper, we use Stringer to evaluate the quality of the clusters (groups of potential duplicates) obtained from several unconstrained clustering algorithms used in concert with approximate join techniques. Our work is motivated by the recent significant advancements that have made approximate join algorithms highly scalable. Our extensive evaluation reveals that some clustering algorithms that have never been considered for duplicate detection, perform extremely well in terms of both accuracy and scalability.