Building a generic debugger for information extraction pipelines

Authors:
Anish Das Sarma;Alpa Jain;Philip Bohannon
Affiliations:
Yahoo, Mountain View, CA, USA;Yahoo, Mountain View, CA, USA;Yahoo, Mountain View, CA, USA
Venue:
Proceedings of the 20th ACM international conference on Information and knowledge management
Year:
2011

Citing 13
Cited 2

Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
ULDBs: databases with uncertainty and lineage

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Espresso: leveraging generic patterns for automatically harvesting semantic relations

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Toward best-effort information extraction

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
On the provenance of non-answers to queries over extracted data

Proceedings of the VLDB Endowment
A quality-aware optimizer for information extraction

ACM Transactions on Database Systems (TODS)
Optimizing SQL Queries over Text Databases

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Exploring a Few Good Tuples from Text Databases

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Provenance in Databases: Why, How, and Where

Foundations and Trends in Databases
Organizing and searching the world wide web of facts - step one: the one-million fact extraction challenge

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
MING: mining informative entity relationship subgraphs

Proceedings of the 18th ACM conference on Information and knowledge management
Understanding provenance black boxes

Distributed and Parallel Databases
I4E: interactive investigation of iterative information extraction

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Automatic pipeline construction for real-time annotation

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
Information extraction as a filtering task

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Complex information extraction (IE) pipelines are becoming an integral component of most text processing frameworks. We introduce a first system to help IE users analyze extraction pipeline semantics and operator transformations interactively while debugging. This allows the effort to be proportional to the need, and to focus on the portions of the pipeline under the greatest suspicion. We present a generic debugger for running post-execution analysis of any IE pipeline consisting of arbitrary types of operators. For this, we propose an effective provenance model for IE pipelines which captures a variety of operator types, ranging from those for which full to no specifications are available. We have evaluated our proposed algorithms and provenance model on large-scale real-world extraction pipelines.