Algorithms for clustering data
Algorithms for clustering data
Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Data integration using similarity joins and a word-based information representation language
ACM Transactions on Information Systems (TOIS)
FOCS '02 Proceedings of the 43rd Symposium on Foundations of Computer Science
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
A Multi-scale Algorithm for the Linear Arrangement Problem
WG '02 Revised Papers from the 28th International Workshop on Graph-Theoretic Concepts in Computer Science
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Efficient set joins on similarity predicates
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Robust Identification of Fuzzy Duplicates
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Clustering with qualitative information
Journal of Computer and System Sciences - Special issue: Learning theory 2003
Clean Answers over Dirty Databases: A Probabilistic Approach
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Adaptive Name Matching in Information Integration
IEEE Intelligent Systems
Supporting ad-hoc ranking aggregates
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
A divide-and-merge methodology for clustering
ACM Transactions on Database Systems (TODS)
Adaptive Blocking: Learning to Scale Up Record Linkage
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Collective entity resolution in relational data
ACM Transactions on Knowledge Discovery from Data (TKDD)
Eliminating fuzzy duplicates in data warehouses
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Example-driven design of efficient record matching queries
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Data integration with uncertainty
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Bootstrapping pay-as-you-go data integration systems
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Probabilistic top-k and ranking-aggregate queries
ACM Transactions on Database Systems (TODS)
A survey of top-k query processing techniques in relational database systems
ACM Computing Surveys (CSUR)
A unified approach for schema matching, coreference and canonicalization
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Learnable similarity functions and their applications to clustering and record linkage
AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Data-based research at IIT Bombay
ACM SIGMOD Record
International Journal of Multimedia Data Engineering & Management
Hi-index | 0.00 |
We propose efficient techniques for processing various Top-K count queries on data with noisy duplicates. Our method differs from existing work on duplicate elimination in two significant ways: First, we dedup on the fly only the part of the data needed for the answer --- a requirement in massive and evolving sources where batch deduplication is expensive. The non-local nature of the problem of partitioning data into duplicate groups, makes it challenging to filter only those tuples forming the K largest groups. We propose a novel method of successively collapsing and pruning records which yield an order of magnitude reduction in running time compared to deduplicating the entire data first. Second, we return multiple high scoring answers to handle situations where it is impossible to resolve if two records are indeed duplicates of each other. Since finding even the highest scoring deduplication is NP-hard, the existing approach is to deploy one of many variants of score-based clustering algorithms which do not easily generalize to finding multiple groupings. We model deduplication as a segmentation of a linear embedding of records and present a polynomial time algorithm for finding the R highest scoring answers. This method closely matches the accuracy of an exact exponential time algorithm on several datasets.