Eliminating fuzzy duplicates in data warehouses

Authors:
Rohit Ananthakrishna;Surajit Chaudhuri;Venkatesh Ganti
Affiliations:
Cornell University;Microsoft Research;Microsoft Research
Venue:
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Year:
2002

Citing 14
Cited 105

Algorithms for inferring functional dependencies from relations

Data & Knowledge Engineering
Approximate inference of functional dependencies from relations

ICDT '92 Selected papers of the fourth international conference on Database theory
The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
CACTUS—clustering categorical data using summaries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Modern Information Retrieval

Modern Information Retrieval
Efficient Discovery of Functional and Approximate Dependencies Using Partitions

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Declarative Data Cleaning: Language, Model, and Algorithms

Proceedings of the 27th International Conference on Very Large Data Bases
Potter's Wheel: An Interactive Data Cleaning System

Proceedings of the 27th International Conference on Very Large Data Bases
Generic Schema Matching with Cupid

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
ROCK: A Robust Clustering Algorithm for Categorical Attributes

ICDE '99 Proceedings of the 15th International Conference on Data Engineering

Text joins in an RDBMS for web data integration

WWW '03 Proceedings of the 12th international conference on World Wide Web
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Information-theoretic tools for mining database structure from large data sets

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Iterative record linkage for cleaning and integration

Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Detecting duplicate objects in XML documents

Proceedings of the 2004 international workshop on Information quality in information systems
Methods for evaluating and creating data quality

Information Systems - Special issue: Data quality in cooperative information systems
Schema Matching Using Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Robust Identification of Fuzzy Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Duplicate detection in click streams

WWW '05 Proceedings of the 14th international conference on World Wide Web
Comparative study of name disambiguation problem using a scalable blocking-based framework

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
DogmatiX tracks down duplicates in XML

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Exploiting relationships for object consolidation

Proceedings of the 2nd international workshop on Information quality in information systems
Effective and scalable solutions for mixed and split citation problems in digital libraries

Proceedings of the 2nd international workshop on Information quality in information systems
Selectivity estimation for fuzzy string predicates in large data sets

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Indexing mixed types for approximate retrieval

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Relational clustering for multi-type entity resolution

MRDM '05 Proceedings of the 4th international workshop on Multi-relational mining
Semantic-integration research in the database community

AI Magazine - Special issue on semantic integration
Establishing value mappings using statistical models and user feedback

Proceedings of the 14th ACM international conference on Information and knowledge management
Link mining: a survey

ACM SIGKDD Explorations Newsletter
Profile-Based Object Matching for Information Integration

IEEE Intelligent Systems
Domain-independent data cleaning via analysis of entity-relationship graph

ACM Transactions on Database Systems (TODS)
Approximately detecting duplicates for streaming data using stable bloom filters

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Query-time entity resolution

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Editorial: Special issue on mining low-quality data

Knowledge and Information Systems - Special Issue on Mining Low-Quality Data
Benchmarking declarative approximate selection predicates

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Leveraging aggregate constraints for deduplication

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Adaptive sorted neighborhood methods for efficient record linkage

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Adaptive graphical approach to entity resolution

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Management of probabilistic data: foundations and challenges

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Leveraging semantic technologies for enterprise search

Proceedings of the ACM first Ph.D. workshop in CIKM
Web based linkage

Proceedings of the 9th annual ACM international workshop on Web information and data management
Management of data with uncertainties

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Structure-based inference of xml similarity for fuzzy duplicate detection

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
SEPIA: estimating selectivities of approximate string predicates in large Databases

The VLDB Journal — The International Journal on Very Large Data Bases
De-duping URLs via rewrite rules

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Ontology-Driven Approximate Duplicate Elimination of Postal Addresses

IEA/AIE '08 Proceedings of the 21st international conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems: New Frontiers in Applied Artificial Intelligence
Probabilistic Entity Linkage for Heterogeneous Information Spaces

CAiSE '08 Proceedings of the 20th international conference on Advanced Information Systems Engineering
A dynamic data structure for top-k queries on uncertain data

Theoretical Computer Science
Approximate lineage for probabilistic databases

Proceedings of the VLDB Endowment
Industry-scale duplicate detection

Proceedings of the VLDB Endowment
Information Extraction

Foundations and Trends in Databases
Efficient top-k count queries over imprecise duplicates

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Incorporating cardinality constraints and synonym rules into conditional functional dependencies

Information Processing Letters
Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
A web of concepts

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A Method for Automatic Discovery of Reference Data

IEA/AIE '09 Proceedings of the 22nd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems: Next-Generation Applied Intelligence
Optimal Stopping: A Record-Linkage Approach

Journal of Data and Information Quality (JDIQ)
Improved approximate detection of duplicates for data streams over sliding windows

Journal of Computer Science and Technology
Query-time entity resolution

Journal of Artificial Intelligence Research
A translation model for matching reviews to objects

Proceedings of the 18th ACM conference on Information and knowledge management
Context-sensitive document ranking

Proceedings of the 18th ACM conference on Information and knowledge management
Reasoning about record matching rules

Proceedings of the VLDB Endowment
"Same, Same but Different" A Survey on Duplicate Detection Methods for Situation Awareness

OTM '09 Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part II
Entity-aware query processing for heterogeneous data with uncertainty and correlations

Proceedings of the 2009 EDBT/ICDT Workshops
Matching reviews to objects using a language model

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
An incremental clustering scheme for data de-duplication

Data Mining and Knowledge Discovery
Declarative XML data cleaning with XClean

CAiSE'07 Proceedings of the 19th international conference on Advanced information systems engineering
QDex: a database profiler for generic bio-data exploration and quality aware integration

WISE'07 Proceedings of the 2007 international conference on Web information systems engineering
Dynamic structures for top-k queries on uncertain data

ISAAC'07 Proceedings of the 18th international conference on Algorithms and computation
Querying a super-peer in a schema-based super-peer network

DBISP2P'05/06 Proceedings of the 2005/2006 international conference on Databases, information systems, and peer-to-peer computing
Similarity joins of text with incomplete information formats

DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications
Self-tuning in graph-based reference disambiguation

DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications
On active learning of record matching packages

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
On memory and I/O efficient duplication detection for multiple self-clean data sources

DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
A graphical method for reference reconciliation

DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
Rationality of cross-system data duplication: a case study

CAiSE'10 Proceedings of the 22nd international conference on Advanced information systems engineering
A multilevel and domain-independent duplicate detection model for scientific database

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Evaluating entity resolution results

Proceedings of the VLDB Endowment
On-the-fly entity-aware query processing in the presence of linkage

Proceedings of the VLDB Endowment
Evaluation of entity resolution approaches on real-world match problems

Proceedings of the VLDB Endowment
Large-scale collective entity matching

Proceedings of the VLDB Endowment
Context-sensitive document ranking

Journal of Computer Science and Technology
Approximate entity extraction in temporal databases

World Wide Web
Identity matching using personal and social identity features

Information Systems Frontiers
XML based framework for ETL processes for relational databases

ACOS'06 Proceedings of the 5th WSEAS international conference on Applied computer science
A set of experiments to consider data quality criteria in classification techniques for data mining

ICCSA'11 Proceedings of the 2011 international conference on Computational science and its applications - Volume Part II
Dynamic constraints for record matching

The VLDB Journal — The International Journal on Very Large Data Bases
Meta similarity

Applied Intelligence
Duplicate detection through structure optimization

Proceedings of the 20th ACM international conference on Information and knowledge management
Enforcing strictness in integration of dimensions: beyond instance matching

Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP
Identifying value mappings for data integration: an unsupervised approach

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
XML duplicate detection using sorted neighborhoods

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Virtual integration of existing web databases for the genotypic selection of cereal cultivars

ODBASE'06/OTM'06 Proceedings of the 2006 Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, DOA, GADA, and ODBASE - Volume Part I
Probabilistic iterative duplicate detection

OTM'05 Proceedings of the 2005 OTM Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, COA, and ODBASE - Volume Part II
Unsupervised duplicate detection using sample non-duplicates

Journal on Data Semantics VII
Multiple valued logic approach for matching patient records in multiple databases

Journal of Biomedical Informatics
Similarity function recommender service using incremental user knowledge acquisition

ICSOC'11 Proceedings of the 9th international conference on Service-Oriented Computing
Linking records in dynamic world

PhD '12 Proceedings of the on SIGMOD/PODS 2012 PhD Symposium
Open business intelligence: on the importance of data quality awareness in user-friendly data mining

Proceedings of the 2012 Joint EDBT/ICDT Workshops
Discovering links among social networks

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II
A machine learning approach for instance matching based on similarity metrics

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
Adaptive Connection Strength Models for Relationship-Based Entity Resolution

Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution
Schema matching and embedded value mapping for databases with opaque column names and mixed continuous and discrete-valued data fields

ACM Transactions on Database Systems (TODS)
GRDB: a system for declarative and interactive analysis of noisy information networks

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Effective string processing and matching for author disambiguation

Proceedings of the 2013 KDD Cup 2013 Workshop
Similarity evaluation in XML schema and XLink

Proceedings of the 19th Brazilian symposium on Multimedia and the web
Query-driven approach to entity resolution

Proceedings of the VLDB Endowment
Hybrid entity clustering using crowds and data

The VLDB Journal — The International Journal on Very Large Data Bases
Efficient entity matching using materialized lists

Information Sciences: an International Journal
Escaping the Big Brother: An empirical study on factors influencing identification and information leakage on the Web

Journal of Information Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

The duplicate elimination problem of detecting multiple tuples, which describe the same real world entity, is an important data cleaning problem. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such approaches result in large numbers of false positives if we want to identify domain-specific abbreviations and conventions. In this paper, we develop an algorithm for eliminating duplicates in dimensional tables in a data warehouse, which are usually associated with hierarchies. We exploit hierarchies to develop a high quality, scalable duplicate elimination algorithm, and evaluate it on real datasets from an operational data warehouse.