Interaction between record matching and data repairing

Authors:
Wenfei Fan;Jianzhong Li;Shuai Ma;Nan Tang;Wenyuan Yu
Affiliations:
University of Edinburgh, Edinburgh, United Kingdom;Harbin Institute of Technology, Harbin, China;Beihang University, Beijing, China;University of Edinburgh, Edinburgh, United Kingdom;University of Edinburgh, Edinburgh, United Kingdom
Venue:
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Year:
2011

Citing 32
Cited 11

Fuzzy sets, uncertainty, and information

Fuzzy sets, uncertainty, and information
Elements of information theory

Elements of information theory
The impact of poor data quality on the typical enterprise

Communications of the ACM
Foundations of Databases: The Logical Level

Foundations of Databases: The Logical Level
Introduction to Algorithms

Introduction to Algorithms
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
The Complexity of Set Constraints

CSL '93 Selected Papers from the 7th Workshop on Computer Science Logic
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Answer sets for consistent query answering in inconsistent databases

Theory and Practice of Logic Programming
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
A cost-based model and effective heuristic for repairing constraints by value modification

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
DogmatiX tracks down duplicates in XML

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Complexity Theory: Exploring the Limits of Efficient Algorithms

Complexity Theory: Exploring the Limits of Efficient Algorithms
Database repairing using updates

ACM Transactions on Database Systems (TODS)
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Improving data quality: consistency and accuracy

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Conditional functional dependencies for capturing data inconsistencies

ACM Transactions on Database Systems (TODS)
Discovering data quality rules

Proceedings of the VLDB Endowment
Master Data Management

Master Data Management
Data Quality and Record Linkage Techniques

Data Quality and Record Linkage Techniques
Large-Scale Deduplication with Constraints Using Dedupalog

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Robust record linkage blocking using suffix arrays

Proceedings of the 18th ACM conference on Information and knowledge management
Discovering matching dependencies

Proceedings of the 18th ACM conference on Information and knowledge management
Generic entity resolution with negative rules

The VLDB Journal — The International Journal on Very Large Data Bases
Reasoning about record matching rules

Proceedings of the VLDB Endowment
Modeling and querying possible repairs in duplicate detection

Proceedings of the VLDB Endowment
ERACER: a database approach for statistical inference and data cleaning

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
GDR: a system for guided data repair

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Information theory for data management

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Towards certain fixes with editing rules and master data

Proceedings of the VLDB Endowment
Record linkage with uniqueness constraints and erroneous values

Proceedings of the VLDB Endowment
Global detection of complex copying relationships between sources

Proceedings of the VLDB Endowment

Towards certain fixes with editing rules and master data

The VLDB Journal — The International Journal on Very Large Data Bases
Integrating open government data with stratosphere for more transparency

Web Semantics: Science, Services and Agents on the World Wide Web
The data analytics group at the qatar computing research institute

ACM SIGMOD Record
Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Determining the relative accuracy of attributes

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
NADEEF: a commodity data cleaning system

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Editorial: Efficient discovery of similarity constraints for matching dependencies

Data & Knowledge Engineering
NADEEF: a generalized data cleaning system

Proceedings of the VLDB Endowment
On repairing structural problems in semi-structured data

Proceedings of the VLDB Endowment
The LLUNATIC data-cleaning framework

Proceedings of the VLDB Endowment
Extending inclusion dependencies with conditions

Theoretical Computer Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Central to a data cleaning system are record matching and data repairing. Matching aims to identify tuples that refer to the same real-world object, and repairing is to make a database consistent by fixing errors in the data by using constraints. These are treated as separate processes in current data cleaning systems, based on heuristic solutions. This paper studies a new problem, namely, the interaction between record matching and data repairing. We show that repairing can effectively help us identify matches, and vice versa. To capture the interaction, we propose a uniform framework that seamlessly unifies repairing and matching operations, to clean a database based on integrity constraints, matching rules and master data. We give a full treatment of fundamental problems associated with data cleaning via matching and repairing, including the static analyses of constraints and rules taken together, and the complexity, termination and determinism analyses of data cleaning. We show that these problems are hard, ranging from NP- or coNP-complete, to PSPACE-complete. Nevertheless, we propose efficient algorithms to clean data via both matching and repairing. The algorithms find deterministic fixes and reliable fixes based on confidence and entropy analysis, respectively, which are more accurate than possible fixes generated by heuristics. We experimentally verify that our techniques significantly improve the accuracy of record matching and data repairing taken as separate processes, using real-life data.