On the provenance of non-answers to queries over extracted data
Proceedings of the VLDB Endowment
A quality-aware optimizer for information extraction
ACM Transactions on Database Systems (TODS)
Building query optimizers for information extraction: the SQoUT project
ACM SIGMOD Record
Expressive and flexible access to web-extracted data: a keyword-based structured query language
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
I4E: interactive investigation of iterative information extraction
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Self-supervised web search for any-k complete tuples
Proceedings of the 2nd International Workshop on Business intelligencE and the WEB
Building a generic debugger for information extraction pipelines
Proceedings of the 20th ACM international conference on Information and knowledge management
NoDB: efficient query execution on raw data files
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Just-in-time information extraction using extraction views
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Beyond search: Retrieving complete tuples from a text-database
Information Systems Frontiers
INDREX: in-database distributional relation extraction
Proceedings of the sixteenth international workshop on Data warehousing and OLAP
When speed has a price: fast information extraction using approximate algorithms
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Text documents often embed data that is structured in nature, and we can expose this structured data using information extraction technology. By processing a text database with information extraction systems, we can materialize a variety of structured "relations," over which we can then issue regular SQL queries. A key challenge to process SQL queries in this text-based scenario is efficiency: information extraction is time-consuming, so query processing strategies should minimize the number of documents that they process. Another key challenge is result quality: in the traditional relational world, all correct execution strategies for a SQL query produce the same (correct) result; in contrast, a SQL query execution over a text database might produce answers that are not fully accurate or complete, for a number of reasons. To address these challenges, we study a family of select-project-join SQL queries over text databases, and characterize query processing strategies on their efficiency and - critically - on their result quality as well. We optimize the execution of SQL queries over text databases in a principled, cost-based manner, incorporating this tradeoff between efficiency and result quality in a user-specific fashion. Our large-scale experiments- over real data sets and multiple information extraction systems - show that our SQL query processing approach consistently picks appropriate execution strategies for the desired balance between efficiency and result quality.