A quality-aware optimizer for information extraction

Authors:
Alpa Jain;Panagiotis G. Ipeirotis
Affiliations:
Columbia University, New York;New York University, New York
Venue:
ACM Transactions on Database Systems (TODS)
Year:
2009

Citing 22
Cited 4

Random sampling for histogram construction: how much is enough?

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
An adaptive query execution system for data integration

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Learning dictionaries for information extraction by multi-level bootstrapping

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Eddies: continuously adaptive query processing

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Robust Classification for Imprecise Environments

Machine Learning
Information Extraction: Techniques and Challenges

SCIE '97 International Summer School on Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Information extraction for enhanced access to disease outbreak reports

Journal of Biomedical Informatics - Special issue: Sublanguage
Web-scale information extraction in knowitall: (preliminary results)

Proceedings of the 13th international conference on World Wide Web
A search engine for natural language applications

WWW '05 Proceedings of the 14th international conference on World Wide Web
ROC confidence bands: an empirical evaluation

ICML '05 Proceedings of the 22nd international conference on Machine learning
Extracting relations from large text collections

Extracting relations from large text collections
Integrating Unstructured Data into Relational Databases

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
To search or to crawl?: towards a query optimizer for text-centric tasks

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Creating probabilistic databases from information extraction models

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Names and similarities on the web: fact extraction in the fast lane

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Towards a query optimizer for text-centric tasks

ACM Transactions on Database Systems (TODS)
Get another label? improving data quality and data mining using multiple, noisy labelers

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Building query optimizers for information extraction: the SQoUT project

ACM SIGMOD Record
Optimizing SQL Queries over Text Databases

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
A probabilistic model of redundancy in information extraction

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence

Building query optimizers for information extraction: the SQoUT project

ACM SIGMOD Record
From information to knowledge: harvesting entities and relationships from web sources

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Building a generic debugger for information extraction pipelines

Proceedings of the 20th ACM international conference on Information and knowledge management
When speed has a price: fast information extraction using approximate algorithms

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

A large amount of structured information is buried in unstructured text. Information extraction systems can extract structured relations from the documents and enable sophisticated, SQL-like queries over unstructured text. Information extraction systems are not perfect and their output has imperfect precision and recall (i.e., contains spurious tuples and misses good tuples). Typically, an extraction system has a set of parameters that can be used as “knobs” to tune the system to be either precision- or recall-oriented. Furthermore, the choice of documents processed by the extraction system also affects the quality of the extracted relation. So far, estimating the output quality of an information extraction task has been an ad hoc procedure, based mainly on heuristics. In this article, we show how to use Receiver Operating Characteristic (ROC) curves to estimate the extraction quality in a statistically robust way and show how to use ROC analysis to select the extraction parameters in a principled manner. Furthermore, we present analytic models that reveal how different document retrieval strategies affect the quality of the extracted relation. Finally, we present our maximum likelihood approach for estimating, on the fly, the parameters required by our analytic models to predict the runtime and the output quality of each execution plan. Our experimental evaluation demonstrates that our optimization approach predicts accurately the output quality and selects the fastest execution plan that satisfies the output quality restrictions.