Exploring a Few Good Tuples from Text Databases

Authors:
Alpa Jain;Divesh Srivastava
Affiliations:
-;-
Venue:
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Year:
2009

Citing 0
Cited 6

Building query optimizers for information extraction: the SQoUT project

ACM SIGMOD Record
I4E: interactive investigation of iterative information extraction

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Popularity-guided top-k extraction of entity attributes

Procceedings of the 13th International Workshop on the Web and Databases
Self-supervised web search for any-k complete tuples

Proceedings of the 2nd International Workshop on Business intelligencE and the WEB
Building a generic debugger for information extraction pipelines

Proceedings of the 20th ACM international conference on Information and knowledge management
Beyond search: Retrieving complete tuples from a text-database

Information Systems Frontiers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information extraction from text databases is a useful paradigm to populate relational tables and unlock the considerable value hidden in plain-text documents. However, information extraction can be expensive, due to various complex text processing steps necessary in uncovering the hidden data. There are a large number of text databases available, and not every text database is necessarily relevant to every relation. Hence, it is important to be able to quickly explore the utility of running an extractor for a specific relation over a given text database before carrying out the expensive extraction task. In this paper, we present a novel exploration methodology of {\em finding a few good tuples} for a relation that can be extracted from a database which allows for judging the relevance of the database for the relation. Specifically, we propose the notion of a good(k, $\ell$) query as one that can return any $k$ tuples for a relation among the top-$\ell$ fraction of tuples ranked by their aggregated confidence scores, provided by the extractor; if these tuples have high scores, the database can be determined as relevant to the relation. We formalize the access model for information extraction, and investigate efficient query processing algorithms for good(k, $\ell$) queries, which do not rely on any prior knowledge about the extraction task or the database. We demonstrate the viability of our algorithms using a detailed experimental study with real text databases.