Dynamic constraints for record matching

Authors:
Wenfei Fan;Hong Gao;Xibei Jia;Jianzhong Li;Shuai Ma
Affiliations:
University of Edinburgh, Edinburgh, UK and Harbin Institute of Technology, Harbin, China;Harbin Institute of Technology, Harbin, China;School of Informatics, University of Edinburgh, Edinburgh, UK;Harbin Institute of Technology, Harbin, China;School of Informatics, University of Edinburgh, Edinburgh, UK
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2011

Citing 36
Cited 6

Dynamic functional dependencies and database aging

Journal of the ACM (JACM)
The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Entity identification in database integration

Information Sciences: an International Journal
Computational problems related to the design of normal form relational schemas

ACM Transactions on Database Systems (TODS)
Automating the approximate record-matching process

Information Sciences—Informatics and Computer Science: An International Journal
Foundations of Databases: The Logical Level

Foundations of Databases: The Logical Level
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Declarative Data Cleaning: Language, Model, and Algorithms

Proceedings of the 27th International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
iMAP: discovering complex semantic matches between database schemas

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Methods for evaluating and creating data quality

Information Systems - Special issue: Data quality in cooperative information systems
DogmatiX tracks down duplicates in XML

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Clio grows up: from research prototype to industrial tool

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Schema and ontology matching with COMA++

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Theory of Relational Databases

Theory of Relational Databases
GORDIAN: efficient and scalable discovery of composite keys

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications)

Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications)
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Leveraging aggregate constraints for deduplication

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Reasoning about XML update constraints

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Merging the results of approximate match operations

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Example-driven design of efficient record matching queries

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Dependencies revisited for improving data quality

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Industry-scale duplicate detection

Proceedings of the VLDB Endowment
Master Data Management

Master Data Management
Transformation-based Framework for Record Matching

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Large-Scale Deduplication with Constraints Using Dedupalog

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Metric Functional Dependencies

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Constraint-based entity matching

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Reasoning about record matching rules

Proceedings of the VLDB Endowment
Discovering Conditional Functional Dependencies

IEEE Transactions on Knowledge and Data Engineering
Data tables with similarity relations: functional dependencies, complete rules and non-redundant bases

DASFAA'06 Proceedings of the 11th international conference on Database Systems for Advanced Applications
Object identification with attribute-mediated dependences

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases

Towards certain fixes with editing rules and master data

The VLDB Journal — The International Journal on Very Large Data Bases
Leveraging matching dependencies for guided user feedback in linked data applications

Proceedings of the Ninth International Workshop on Information Integration on the Web
Exploiting evidence from unstructured data to enhance master data management

Proceedings of the VLDB Endowment
Editorial: Efficient discovery of similarity constraints for matching dependencies

Data & Knowledge Engineering
The LLUNATIC data-cleaning framework

Proceedings of the VLDB Endowment
Extending inclusion dependencies with conditions

Theoretical Computer Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper investigates constraints for matching records from unreliable data sources. (a) We introduce a class of matching dependencies (mds) for specifying the semantics of unreliable data. As opposed to static constraints for schema design, mds are developed for record matching, and are defined in terms of similarity predicates and a dynamic semantics. (b) We identify a special case of mds, referred to as relative candidate keys (rcks), to determine what attributes to compare and how to compare them when matching records across possibly different relations. (c) We propose a mechanism for inferring mds, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. Moreover, we develop a sound and complete system for inferring mds. (d) We provide a quadratic-time algorithm for inferring mds and an effective algorithm for deducing a set of high-quality rcks from mds. (e) We experimentally verify that the algorithms help matching tools efficiently identify keys at compile time for matching, blocking or windowing and in addition, that the md-based techniques effectively improve the quality and efficiency of various record matching methods.