Optimizing SQL Queries over Text Databases

Authors:
Alpa Jain;AnHai Doan;Luis Gravano
Affiliations:
Computer Science Department, Columbia University. alpa@cs.columbia.edu;Department of Computer Sciences, University of Wisconsin-Madison. anhai@cs.wisc.edu;Computer Science Department, Columbia University. gravano@cs.columbia.edu
Venue:
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Year:
2008

Citing 0
Cited 12

On the provenance of non-answers to queries over extracted data

Proceedings of the VLDB Endowment
A quality-aware optimizer for information extraction

ACM Transactions on Database Systems (TODS)
Building query optimizers for information extraction: the SQoUT project

ACM SIGMOD Record
Expressive and flexible access to web-extracted data: a keyword-based structured query language

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
I4E: interactive investigation of iterative information extraction

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Self-supervised web search for any-k complete tuples

Proceedings of the 2nd International Workshop on Business intelligencE and the WEB
Building a generic debugger for information extraction pipelines

Proceedings of the 20th ACM international conference on Information and knowledge management
NoDB: efficient query execution on raw data files

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Just-in-time information extraction using extraction views

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Beyond search: Retrieving complete tuples from a text-database

Information Systems Frontiers
INDREX: in-database distributional relation extraction

Proceedings of the sixteenth international workshop on Data warehousing and OLAP
When speed has a price: fast information extraction using approximate algorithms

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text documents often embed data that is structured in nature, and we can expose this structured data using information extraction technology. By processing a text database with information extraction systems, we can materialize a variety of structured "relations," over which we can then issue regular SQL queries. A key challenge to process SQL queries in this text-based scenario is efficiency: information extraction is time-consuming, so query processing strategies should minimize the number of documents that they process. Another key challenge is result quality: in the traditional relational world, all correct execution strategies for a SQL query produce the same (correct) result; in contrast, a SQL query execution over a text database might produce answers that are not fully accurate or complete, for a number of reasons. To address these challenges, we study a family of select-project-join SQL queries over text databases, and characterize query processing strategies on their efficiency and - critically - on their result quality as well. We optimize the execution of SQL queries over text databases in a principled, cost-based manner, incorporating this tradeoff between efficiency and result quality in a user-specific fashion. Our large-scale experiments- over real data sets and multiple information extraction systems - show that our SQL query processing approach consistently picks appropriate execution strategies for the desired balance between efficiency and result quality.