Fuzzy sets, uncertainty, and information
Fuzzy sets, uncertainty, and information
Elements of information theory
Elements of information theory
The impact of poor data quality on the typical enterprise
Communications of the ACM
Foundations of Databases: The Logical Level
Foundations of Databases: The Logical Level
Introduction to Algorithms
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
The Complexity of Set Constraints
CSL '93 Selected Papers from the 7th Workshop on Computer Science Logic
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Answer sets for consistent query answering in inconsistent databases
Theory and Practice of Logic Programming
Reference reconciliation in complex information spaces
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
A cost-based model and effective heuristic for repairing constraints by value modification
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
DogmatiX tracks down duplicates in XML
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Complexity Theory: Exploring the Limits of Efficient Algorithms
Complexity Theory: Exploring the Limits of Efficient Algorithms
Database repairing using updates
ACM Transactions on Database Systems (TODS)
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Improving data quality: consistency and accuracy
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Conditional functional dependencies for capturing data inconsistencies
ACM Transactions on Database Systems (TODS)
Discovering data quality rules
Proceedings of the VLDB Endowment
Master Data Management
Data Quality and Record Linkage Techniques
Data Quality and Record Linkage Techniques
Large-Scale Deduplication with Constraints Using Dedupalog
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Robust record linkage blocking using suffix arrays
Proceedings of the 18th ACM conference on Information and knowledge management
Discovering matching dependencies
Proceedings of the 18th ACM conference on Information and knowledge management
Generic entity resolution with negative rules
The VLDB Journal — The International Journal on Very Large Data Bases
Reasoning about record matching rules
Proceedings of the VLDB Endowment
Modeling and querying possible repairs in duplicate detection
Proceedings of the VLDB Endowment
ERACER: a database approach for statistical inference and data cleaning
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
GDR: a system for guided data repair
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Information theory for data management
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Towards certain fixes with editing rules and master data
Proceedings of the VLDB Endowment
Record linkage with uniqueness constraints and erroneous values
Proceedings of the VLDB Endowment
Global detection of complex copying relationships between sources
Proceedings of the VLDB Endowment
Towards certain fixes with editing rules and master data
The VLDB Journal — The International Journal on Very Large Data Bases
Integrating open government data with stratosphere for more transparency
Web Semantics: Science, Services and Agents on the World Wide Web
The data analytics group at the qatar computing research institute
ACM SIGMOD Record
Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Determining the relative accuracy of attributes
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
NADEEF: a commodity data cleaning system
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Editorial: Efficient discovery of similarity constraints for matching dependencies
Data & Knowledge Engineering
NADEEF: a generalized data cleaning system
Proceedings of the VLDB Endowment
On repairing structural problems in semi-structured data
Proceedings of the VLDB Endowment
The LLUNATIC data-cleaning framework
Proceedings of the VLDB Endowment
Extending inclusion dependencies with conditions
Theoretical Computer Science
Hi-index | 0.00 |
Central to a data cleaning system are record matching and data repairing. Matching aims to identify tuples that refer to the same real-world object, and repairing is to make a database consistent by fixing errors in the data by using constraints. These are treated as separate processes in current data cleaning systems, based on heuristic solutions. This paper studies a new problem, namely, the interaction between record matching and data repairing. We show that repairing can effectively help us identify matches, and vice versa. To capture the interaction, we propose a uniform framework that seamlessly unifies repairing and matching operations, to clean a database based on integrity constraints, matching rules and master data. We give a full treatment of fundamental problems associated with data cleaning via matching and repairing, including the static analyses of constraints and rules taken together, and the complexity, termination and determinism analyses of data cleaning. We show that these problems are hard, ranging from NP- or coNP-complete, to PSPACE-complete. Nevertheless, we propose efficient algorithms to clean data via both matching and repairing. The algorithms find deterministic fixes and reliable fixes based on confidence and entropy analysis, respectively, which are more accurate than possible fixes generated by heuristics. We experimentally verify that our techniques significantly improve the accuracy of record matching and data repairing taken as separate processes, using real-life data.