Query processing over incomplete autonomous databases: query rewriting using learned data dependencies

Authors:
Garrett Wolf;Aravind Kalavagattu;Hemal Khatri;Raju Balakrishnan;Bhaumik Chokshi;Jianchun Fan;Yi Chen;Subbarao Kambhampati
Affiliations:
Arizona State University, Tempe, USA;Arizona State University, Tempe, USA;Arizona State University, Tempe, USA;Arizona State University, Tempe, USA;Arizona State University, Tempe, USA;Arizona State University, Tempe, USA;Arizona State University, Tempe, USA;Arizona State University, Tempe, USA
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2009

Citing 29
Cited 2

Incomplete Information in Relational Databases

Journal of the ACM (JACM)
Statistical analysis with missing data

Statistical analysis with missing data
Selection of relevant features and examples in machine learning

Artificial Intelligence - Special issue on relevance
On semantic issues connected with incomplete information databases

ACM Transactions on Database Systems (TODS)
Reconciling schemas of disparate data sources: a machine-learning approach

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Robust Learning with Missing Data

Machine Learning
Machine Learning

Machine Learning
Approximate Dependency Inference from Relations

ICDT '92 Proceedings of the 4th International Conference on Database Theory
Efficient Discovery of Functional and Approximate Dependencies Using Partitions

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Evaluating probabilistic queries over imprecise data

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adapting to source properties in processing data integration queries

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
CORDS: automatic discovery of correlations and soft functional dependencies

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Using Association Rules for Completing Missing Data

HIS '04 Proceedings of the Fourth International Conference on Hybrid Intelligent Systems
Foundations of probabilistic answers to queries

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Answering queries from statistics and probabilistic views

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Working Models for Uncertain Data

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Clean Answers over Dirty Databases: A Probabilistic Approach

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Answering Imprecise Queries over Autonomous Web Databases

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Data exchange and incomplete information

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Consistent query answering in databases

ACM SIGMOD Record
Creating probabilistic databases from information extraction models

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
OLAP over uncertain and imprecise data

The VLDB Journal — The International Journal on Very Large Data Bases
Leveraging aggregate constraints for deduplication

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Management of probabilistic data: foundations and challenges

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Efficient query evaluation on probabilistic databases

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Mining functional dependencies from data

Data Mining and Knowledge Discovery
Dependencies revisited for improving data quality

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On generating near-optimal tableaux for conditional functional dependencies

Proceedings of the VLDB Endowment
Online query relaxation via Bayesian causal structures discovery

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2

Design by example for SQL table definitions with functional dependencies

The VLDB Journal — The International Journal on Very Large Data Bases
Assessing relevance and trust of the deep web sources and results based on inter-source agreement

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Incompleteness due to missing attribute values (aka "null values") is very common in autonomous web databases, on which user accesses are usually supported through mediators. Traditional query processing techniques that focus on the strict soundness of answer tuples often ignore tuples with critical missing attributes, even if they wind up being relevant to a user query. Ideally we would like the mediator to retrieve such possibleanswers and gauge their relevance by accessing their likelihood of being pertinent answers to the query. The autonomous nature of web databases poses several challenges in realizing this objective. Such challenges include the restricted access privileges imposed on the data, the limited support for query patterns, and the bounded pool of database and network resources in the web environment. We introduce a novel query rewriting and optimization framework QPIAD that tackles these challenges. Our technique involves reformulating the user query based on mined correlations among the database attributes. The reformulated queries are aimed at retrieving the relevant possibleanswers in addition to the certain answers. QPIAD is able to gauge the relevance of such queries allowing tradeoffs in reducing the costs of database query processing and answer transmission. To support this framework, we develop methods for mining attribute correlations (in terms of Approximate Functional Dependencies), value distributions (in the form of Naïve Bayes Classifiers), and selectivity estimates. We present empirical studies to demonstrate that our approach is able to effectively retrieve relevant possibleanswers with high precision, high recall, and manageable cost.