Data models, database languages and database management systems
Data models, database languages and database management systems
Meaningful change detection in structured data
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Consistent query answers in inconsistent databases
PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Discovering and reconciling value conflicts for numerical data integration
Information Systems - Data extraction, cleaning and reconciliation
On propagation of deletions and annotations through views
Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Data integration: a theoretical perspective
Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Computers and Intractability: A Guide to the Theory of NP-Completeness
Computers and Intractability: A Guide to the Theory of NP-Completeness
Detecting Group Differences: Mining Contrast Sets
Data Mining and Knowledge Discovery
Discovering Frequent Closed Itemsets for Association Rules
ICDT '99 Proceedings of the 7th International Conference on Database Theory
Condensed Representation of Database Repairs for Consistent Query Answering
ICDT '03 Proceedings of the 9th International Conference on Database Theory
Declarative Data Cleaning: Language, Model, and Algorithms
Proceedings of the 27th International Conference on Very Large Data Bases
Potter's Wheel: An Interactive Data Cleaning System
Proceedings of the 27th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Efficient Snapshot Differential Algorithms for Data Warehousing
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching
The VLDB Journal — The International Journal on Very Large Data Bases
Mining Strong Affinity Association Patterns in Data Sets with Skewed Support Distribution
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
On detecting differences between groups
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining for patterns in contradictory data
Proceedings of the 2004 international workshop on Information quality in information systems
Integration of biological sources: current systems and challenges ahead
ACM SIGMOD Record
A cost-based model and effective heuristic for repairing constraints by value modification
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Data exchange: semantics and query answering
Theoretical Computer Science - Database theory
Describing differences between databases
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Methodological Review: Data integration and genomic medicine
Journal of Biomedical Informatics
Truth discovery with multiple conflicting information providers on the web
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Conditional functional dependencies for capturing data inconsistencies
ACM Transactions on Database Systems (TODS)
Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On generating near-optimal tableaux for conditional functional dependencies
Proceedings of the VLDB Endowment
Discovering data quality rules
Proceedings of the VLDB Endowment
ACM Computing Surveys (CSUR)
Discovering Conditional Functional Dependencies
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Integrating conflicting data: the role of source dependence
Proceedings of the VLDB Endowment
Minimal-change integrity maintenance using tuple deletions
Information and Computation
Corroborating information from disagreeing views
Proceedings of the third ACM international conference on Web search and data mining
Declarative data fusion – syntax, semantics, and implementation
ADBIS'05 Proceedings of the 9th East European conference on Advances in Databases and Information Systems
Quality of information-based source assessment and selection
Neurocomputing
Hi-index | 0.00 |
In many domains, data cleaning is hampered by our limited ability to specify a comprehensive set of integrity constraints to assist in identification of erroneous data. An alternative approach to improve data quality is to exploit different data sources that contain information about the same set of objects. Such overlapping sources highlight hot-spots of poor data quality through conflicting data values and immediately provide alternative values for conflict resolution. In order to derive a dataset of high quality, we can merge the overlapping sources based on a quality assessment of the conflicting values. The quality of the resulting dataset, however, is highly dependent on our ability to asses the quality of conflicting values effectively. The main objective of this article is to introduce methods that aid the developer of an integrated system over overlapping, but contradicting sources in the task of improving the quality of data. Value conflicts between contradicting sources are often systematic, caused by some characteristic of the different sources. Our goal is to identify such systematic differences and outline data patterns that occur in conjunction with them. Evaluated by an expert user, the regularities discovered provide insights into possible conflict reasons and help to assess the quality of inconsistent values. The contributions of this article are two concepts of systematic conflicts: contradiction patterns and minimal update sequences. Contradiction patterns resemble a special form of association rules that summarize characteristic data properties for conflict occurrence. We adapt existing association rule mining algorithms for mining contradiction patterns. Contradiction patterns, however, view each class of conflicts in isolation, sometimes leading to largely overlapping patterns. Sequences of set-oriented update operations that transform one data source into the other are compact descriptions for all regular differences among the sources. We consider minimal update sequences as the most likely explanation for observed differences between overlapping data sources. Furthermore, the order of operations within the sequences point out potential dependencies between systematic differences. Finding minimal update sequences, however, is beyond reach in practice. We show that the problem already is NP-complete for a restricted set of operations. In the light of this intractability result, we present heuristics that lead to convincing results for all examples we considered.