When speed has a price: fast information extraction using approximate algorithms

Authors:
Gonçalo Simões;Helena Galhardas;Luis Gravano
Affiliations:
INESC-ID and Instituto Superior Técnico, Portugal;INESC-ID and Instituto Superior Técnico, Portugal;Columbia University, New York
Venue:
Proceedings of the VLDB Endowment
Year:
2013

Citing 15
Cited 0

The Harpy speech understanding system

Readings in speech recognition
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Information extraction for enhanced access to disease outbreak reports

Journal of Biomedical Informatics - Special issue: Sublanguage
Information Extraction: Distilling Structured Data from Unstructured Text

Queue - Social Computing
Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
To search or to crawl?: towards a query optimizer for text-centric tasks

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Declarative information extraction using datalog with embedded extraction predicates

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Web-scale named entity recognition

Proceedings of the 17th ACM conference on Information and knowledge management
A quality-aware optimizer for information extraction

ACM Transactions on Database Systems (TODS)
SystemT: a system for declarative information extraction

ACM SIGMOD Record
An Algebraic Approach to Rule-Based Information Extraction

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Optimizing SQL Queries over Text Databases

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
A hidden Markov model based named entity recognition system: Bengali and Hindi as case studies

PReMI'07 Proceedings of the 2nd international conference on Pattern recognition and machine intelligence
Searching patterns for relation extraction over the web: rediscovering the pattern-relation duality

Proceedings of the fourth ACM international conference on Web search and data mining
Error bounds for convolutional codes and an asymptotically optimum decoding algorithm

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

A wealth of information produced by individuals and organizations is expressed in natural language text. This is a problem since text lacks the explicit structure that is necessary to support rich querying and analysis. Information extraction systems are sophisticated software tools to discover structured information in natural language text. Unfortunately, information extraction is a challenging and time-consuming task. In this paper, we address the limitations of state-of-the-art systems for the optimization of information extraction programs, with the objective of producing efficient extraction executions. Our solution relies on exploiting a wide range of optimization opportunities. For efficiency, we consider a wide spectrum of execution plans, including approximate plans whose results differ in their precision and recall. Our optimizer accounts for these characteristics of the competing execution plans, and uses accurate predictors of their extraction time, recall, and precision. We demonstrate the efficiency and effectiveness of our optimizer through a large-scale experimental evaluation over real-world datasets and multiple extraction tasks and approaches.