SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Duplicate record elimination in large data files
ACM Transactions on Database Systems (TODS)
LOF: identifying density-based local outliers
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
Multiple Similarity Queries: A Basic DBMS Operation for Mining in Metric Databases
IEEE Transactions on Knowledge and Data Engineering
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
Searching in metric spaces by spatial approximation
The VLDB Journal — The International Journal on Very Large Data Bases
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning domain-independent string transformation weights for high accuracy object identification
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Approximate Nearest Neighbor under edit distance via product metrics
SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Eliminating fuzzy duplicates in data warehouses
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Duplicate detection in click streams
WWW '05 Proceedings of the 14th international conference on World Wide Web
Approximately detecting duplicates for streaming data using stable bloom filters
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Efficiently linking text documents with relevant structured information
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
eTuner: tuning schema matching software using synthetic scenarios
The VLDB Journal — The International Journal on Very Large Data Bases
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Yago: a core of semantic knowledge
Proceedings of the 16th international conference on World Wide Web
Leveraging aggregate constraints for deduplication
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Towards automated record linkage
AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Essential deduplication functions for transactional databases in law firms
Proceedings of the 11th international conference on Artificial intelligence and law
Structure-based inference of xml similarity for fuzzy duplicate detection
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Measuring the structural similarity of semistructured documents using entropy
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Helping satisfy multiple objectives during a service desk conversation
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Improving the accuracy of entity identification through refinement
Ph.D. '08 Proceedings of the 2008 EDBT Ph.D. workshop
Data & Knowledge Engineering
De-duping URLs via rewrite rules
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
YAGO: A Large Ontology from Wikipedia and WordNet
Web Semantics: Science, Services and Agents on the World Wide Web
Finding duplicates in a data stream
SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Efficient top-k count queries over imprecise duplicates
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Time-completeness trade-offs in record linkage using adaptive query processing
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Efficient interactive fuzzy keyword search
Proceedings of the 18th international conference on World wide web
Swoosh: a generic approach to entity resolution
The VLDB Journal — The International Journal on Very Large Data Bases
Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Entity resolution with iterative blocking
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Improved approximate detection of duplicates for data streams over sliding windows
Journal of Computer Science and Technology
Creating probabilistic databases from duplicated data
The VLDB Journal — The International Journal on Very Large Data Bases
Generic entity resolution with negative rules
The VLDB Journal — The International Journal on Very Large Data Bases
Efficient approximate search on string collections
Proceedings of the VLDB Endowment
Modeling and querying possible repairs in duplicate detection
Proceedings of the VLDB Endowment
Framework for evaluating clustering algorithms in duplicate detection
Proceedings of the VLDB Endowment
An incremental clustering scheme for data de-duplication
Data Mining and Knowledge Discovery
From information to knowledge: harvesting entities and relationships from web sources
Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Exact and efficient proximity graph computation
ADBIS'10 Proceedings of the 14th east European conference on Advances in databases and information systems
Evaluating entity resolution results
Proceedings of the VLDB Endowment
Entity resolution with evolving rules
Proceedings of the VLDB Endowment
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient fuzzy full-text type-ahead search
The VLDB Journal — The International Journal on Very Large Data Bases
Web trace duplication detection based on context
WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Duplicate detection through structure optimization
Proceedings of the 20th ACM international conference on Information and knowledge management
Black swan: augmenting statistics with event data
Proceedings of the 20th ACM international conference on Information and knowledge management
Probabilistic data generation for deduplication and data linkage
IDEAL'05 Proceedings of the 6th international conference on Intelligent Data Engineering and Automated Learning
Quality-aware similarity assessment for entity matching in Web data
Information Systems
DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
Unsupervised learning of link discovery configuration
ESWC'12 Proceedings of the 9th international conference on The Semantic Web: research and applications
Entity resolution: theory, practice & open challenges
Proceedings of the VLDB Endowment
MFIBlocks: An effective blocking algorithm for entity resolution
Information Systems
Incremental entity resolution on rules and data
The VLDB Journal — The International Journal on Very Large Data Bases
Clustering with Proximity Graphs: Exact and Efficient Algorithms
International Journal of Knowledge-Based Organizations
Hi-index | 0.00 |
Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more accurately than is possible with existing techniques. Using these criteria, we propose a novel framework for the fuzzy duplicate elimination problem. We show that solutions within the new framework result in better accuracy than earlier approaches. We present an efficient algorithm for solving instantiations within the framework. We evaluate it on real datasets to demonstrate the accuracy and scalability of our algorithm.