Clean Answers over Dirty Databases: A Probabilistic Approach

Authors:
Periklis Andritsos;Ariel Fuxman;Renee J. Miller
Affiliations:
Univesity of Trento;University of Toronto;University of Toronto
Venue:
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Year:
2006

Citing 0
Cited 66

Towards correcting input data errors probabilistically using integrity constraints

MobiDE '06 Proceedings of the 5th ACM international workshop on Data engineering for wireless and mobile access
MauveDB: supporting model-based user views in database systems

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
From complete to incomplete information and back

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Management of probabilistic data: foundations and challenges

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
OLAP over imprecise data with domain constraints

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Materialized views in probabilistic databases: for information exchange and query optimization

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Query processing over incomplete autonomous databases

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
A three-valued semantics for querying and repairing inconsistent databases

Annals of Mathematics and Artificial Intelligence
MCDB: a monte carlo approach to managing uncertain data

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Query evaluation with soft-key constraints

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Dependencies revisited for improving data quality

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Probabilistic top-k and ranking-aggregate queries

ACM Transactions on Database Systems (TODS)
Probabilistic databases

ACM SIGACT News
World-set decompositions: Expressiveness and efficient algorithms

Theoretical Computer Science
Interactive source registration in community-oriented information integration

Proceedings of the VLDB Endowment
Conditioning probabilistic databases

Proceedings of the VLDB Endowment
Cleaning uncertain data with quality guarantees

Proceedings of the VLDB Endowment
Exploiting shared correlations in probabilistic databases

Proceedings of the VLDB Endowment
Systems aspects of probabilistic data management

Proceedings of the VLDB Endowment
Approximate Probabilistic Query Answering over Inconsistent Databases

ER '08 Proceedings of the 27th International Conference on Conceptual Modeling
A compositional framework for complex queries over uncertain data

Proceedings of the 12th International Conference on Database Theory
Efficient top-k count queries over imprecise duplicates

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Probabilistic databases: diamonds in the dirt

Communications of the ACM - Barbara Liskov: ACM's A.M. Turing Award Winner
Consensus answers for queries over probabilistic databases

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Indexing correlated probabilistic databases

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Recursive random fields

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
The trichotomy of HAVING queries on a probabilistic database

The VLDB Journal — The International Journal on Very Large Data Bases
Query processing over incomplete autonomous databases: query rewriting using learned data dependencies

The VLDB Journal — The International Journal on Very Large Data Bases
$${10^{(10^{6})}}$$ worlds and beyond: efficient representation and processing of incomplete information

The VLDB Journal — The International Journal on Very Large Data Bases
Creating probabilistic databases from duplicated data

The VLDB Journal — The International Journal on Very Large Data Bases
PrDB: managing and exploiting rich correlations in probabilistic databases

The VLDB Journal — The International Journal on Very Large Data Bases
A unified approach to ranking in probabilistic databases

Proceedings of the VLDB Endowment
Modeling and querying possible repairs in duplicate detection

Proceedings of the VLDB Endowment
Entity-aware query processing for heterogeneous data with uncertainty and correlations

Proceedings of the 2009 EDBT/ICDT Workshops
Enabling entity-based aggregators for web 2.0 data

Proceedings of the 19th international conference on World wide web
Querying and repairing inconsistent databases under three-valued semantics

ICLP'07 Proceedings of the 23rd international conference on Logic programming
Computing a k-route over uncertain geographical data

SSTD'07 Proceedings of the 10th international conference on Advances in spatial and temporal databases
On the first-order expressibility of computing certain answers to conjunctive queries over uncertain databases

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Leveraging spatio-temporal redundancy for RFID data cleansing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
ERACER: a database approach for statistical inference and data cleaning

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
GRN model of probabilistic databases: construction, transition and querying

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Consistent query answers in inconsistent probabilistic databases

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
DCUBE: CUBE on dirty databases

WAIM'10 Proceedings of the 11th international conference on Web-age information management
On-the-fly entity-aware query processing in the presence of linkage

Proceedings of the VLDB Endowment
Scalable probabilistic databases with factor graphs and MCMC

Proceedings of the VLDB Endowment
Explore or exploit?: effective strategies for disambiguating large databases

Proceedings of the VLDB Endowment
Tractability in probabilistic databases

Proceedings of the 14th International Conference on Database Theory
Annotation based query answer over inconsistent database

Journal of Computer Science and Technology
Queries and materialized views on probabilistic databases

Journal of Computer and System Sciences
On counting database repairs

Proceedings of the 4th International Workshop on Logic in Databases
A unified approach to ranking in probabilistic databases

The VLDB Journal — The International Journal on Very Large Data Bases
Querying uncertain data with aggregate constraints

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
LinkDB: a probabilistic linkage database system

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
The monte carlo database system: Stochastic analysis close to the data

ACM Transactions on Database Systems (TODS)
Incorporating domain knowledge and user expertise in probabilistic Tuple merging

SUM'11 Proceedings of the 5th international conference on Scalable uncertainty management
Cost-efficient repair in inconsistent probabilistic databases

Proceedings of the 20th ACM international conference on Information and knowledge management
Scrubbing query results from probabilistic databases

Proceedings of the 15th Symposium on International Database Engineering & Applications
Consistent query answering: five easy pieces

ICDT'07 Proceedings of the 11th international conference on Database Theory
World-set decompositions: expressiveness and efficient algorithms

ICDT'07 Proceedings of the 11th international conference on Database Theory
Certain conjunctive query answering in first-order logic

ACM Transactions on Database Systems (TODS)
Quality-aware service-oriented data integration: requirements, state of the art and open challenges

ACM SIGMOD Record
Prioritized repairing and consistent query answering in relational databases

Annals of Mathematics and Artificial Intelligence
Probabilistic query answering over inconsistent databases

Annals of Mathematics and Artificial Intelligence
A model of uncertainty for near-duplicates in document reference networks

ECDL'07 Proceedings of the 11th European conference on Research and Advanced Technology for Digital Libraries
A dichotomy in the complexity of counting database repairs

Journal of Computer and System Sciences
Real-time probabilistic data association over streams

Proceedings of the 7th ACM international conference on Distributed event-based systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The detection of duplicate tuples, corresponding to the same real-world entity, is an important task in data integration and cleaning. While many techniques exist to identify such tuples, the merging or elimination of duplicates can be a difficult task that relies on ad-hoc and often manual solutions. We propose a complementary approach that permits declarative query answering over duplicated data, where each duplicate is associated with a probability of being in the clean database. We rewrite queries over a database containing duplicates to return each answer with the probability that the answer is in the clean database. Our rewritten queries are sensitive to the semantics of duplication and help a user understand which query answers are most likely to be present in the clean database. The semantics that we adopt is independent of the way the probabilities are produced, but is able to effectively exploit them during query answering. In the absence of external knowledge that associates each database tuple with a probability, we offer a technique, based on tuple summaries, that automates this task. We experimentally study the performance of our rewritten queries. Our studies show that the rewriting does not introduce a significant overhead in query execution time. This work is done in the context of the ConQuer project at the University of Toronto, which focuses on the efficient management of inconsistent and dirty databases.