Robust Identification of Fuzzy Duplicates

Authors:
Surajit Chaudhuri;Venkatesh Ganti;Rajeev Motwani
Affiliations:
Microsoft Research;Microsoft Research;Stanford University
Venue:
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Year:
2005

Citing 14
Cited 47

Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Duplicate record elimination in large data files

ACM Transactions on Database Systems (TODS)
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Multiple Similarity Queries: A Basic DBMS Operation for Mining in Metric Databases

IEEE Transactions on Knowledge and Data Engineering
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Searching in metric spaces by spatial approximation

The VLDB Journal — The International Journal on Very Large Data Bases
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning domain-independent string transformation weights for high accuracy object identification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Approximate Nearest Neighbor under edit distance via product metrics

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Duplicate detection in click streams

WWW '05 Proceedings of the 14th international conference on World Wide Web
Approximately detecting duplicates for streaming data using stable bloom filters

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Efficiently linking text documents with relevant structured information

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
eTuner: tuning schema matching software using synthetic scenarios

The VLDB Journal — The International Journal on Very Large Data Bases
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Yago: a core of semantic knowledge

Proceedings of the 16th international conference on World Wide Web
Leveraging aggregate constraints for deduplication

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Towards automated record linkage

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Essential deduplication functions for transactional databases in law firms

Proceedings of the 11th international conference on Artificial intelligence and law
Structure-based inference of xml similarity for fuzzy duplicate detection

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Measuring the structural similarity of semistructured documents using entropy

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Helping satisfy multiple objectives during a service desk conversation

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Improving the accuracy of entity identification through refinement

Ph.D. '08 Proceedings of the 2008 EDBT Ph.D. workshop
Entity matching across heterogeneous data sources: An approach based on constrained cascade generalization

Data & Knowledge Engineering
De-duping URLs via rewrite rules

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
YAGO: A Large Ontology from Wikipedia and WordNet

Web Semantics: Science, Services and Agents on the World Wide Web
Finding duplicates in a data stream

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Efficient top-k count queries over imprecise duplicates

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Time-completeness trade-offs in record linkage using adaptive query processing

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Efficient interactive fuzzy keyword search

Proceedings of the 18th international conference on World wide web
Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
A web of concepts

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Entity resolution with iterative blocking

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Improved approximate detection of duplicates for data streams over sliding windows

Journal of Computer Science and Technology
Creating probabilistic databases from duplicated data

The VLDB Journal — The International Journal on Very Large Data Bases
Generic entity resolution with negative rules

The VLDB Journal — The International Journal on Very Large Data Bases
Efficient approximate search on string collections

Proceedings of the VLDB Endowment
Modeling and querying possible repairs in duplicate detection

Proceedings of the VLDB Endowment
Framework for evaluating clustering algorithms in duplicate detection

Proceedings of the VLDB Endowment
An incremental clustering scheme for data de-duplication

Data Mining and Knowledge Discovery
From information to knowledge: harvesting entities and relationships from web sources

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Exact and efficient proximity graph computation

ADBIS'10 Proceedings of the 14th east European conference on Advances in databases and information systems
Evaluating entity resolution results

Proceedings of the VLDB Endowment
Entity resolution with evolving rules

Proceedings of the VLDB Endowment
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient fuzzy full-text type-ahead search

The VLDB Journal — The International Journal on Very Large Data Bases
Web trace duplication detection based on context

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Duplicate detection through structure optimization

Proceedings of the 20th ACM international conference on Information and knowledge management
Black swan: augmenting statistics with event data

Proceedings of the 20th ACM international conference on Information and knowledge management
Probabilistic data generation for deduplication and data linkage

IDEAL'05 Proceedings of the 6th international conference on Intelligent Data Engineering and Automated Learning
Quality-aware similarity assessment for entity matching in Web data

Information Systems
Active duplicate detection

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
Unsupervised learning of link discovery configuration

ESWC'12 Proceedings of the 9th international conference on The Semantic Web: research and applications
Entity resolution: theory, practice & open challenges

Proceedings of the VLDB Endowment
MFIBlocks: An effective blocking algorithm for entity resolution

Information Systems
Incremental entity resolution on rules and data

The VLDB Journal — The International Journal on Very Large Data Bases
Clustering with Proximity Graphs: Exact and Efficient Algorithms

International Journal of Knowledge-Based Organizations

Quantified Score

Hi-index	0.00

Visualization

Abstract

Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more accurately than is possible with existing techniques. Using these criteria, we propose a novel framework for the fuzzy duplicate elimination problem. We show that solutions within the new framework result in better accuracy than earlier approaches. We present an efficient algorithm for solving instantiations within the framework. We evaluate it on real datasets to demonstrate the accuracy and scalability of our algorithm.