I4E: interactive investigation of iterative information extraction

Authors:
Anish Das Sarma;Alpa Jain;Divesh Srivastava
Affiliations:
Yahoo Research, Santa Clara, CA, USA;Yahoo Research, Santa Clara, CA, USA;AT&T Labs-Research, Florham Park, NJ, USA
Venue:
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Year:
2010

Citing 26
Cited 2

Relational learning of pattern-match rules for information extraction

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Learning dictionaries for information extraction by multi-level bootstrapping

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Introduction to Algorithms

Introduction to Algorithms
Supporting Fine-grained Data Lineage in a Database Visualization Environment

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Lineage tracing for general data warehouse transformations

The VLDB Journal — The International Journal on Very Large Data Bases
Web-scale information extraction in knowitall: (preliminary results)

Proceedings of the 13th international conference on World Wide Web
Automatic acquisition of hyponyms from large text corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
DBNotes: a post-it system for relational databases based on provenance

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Integrating Unstructured Data into Relational Databases

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Provenance management in curated databases

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
ULDBs: databases with uncertainty and lineage

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Creating probabilistic databases from information extraction models

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Espresso: leveraging generic patterns for automatically harvesting semantic relations

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Names and similarities on the web: fact extraction in the fast lane

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Provenance semirings

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Towards a query optimizer for text-centric tasks

ACM Transactions on Database Systems (TODS)
On the provenance of non-answers to queries over extracted data

Proceedings of the VLDB Endowment
Approximate lineage for probabilistic databases

Proceedings of the VLDB Endowment
Efficient Information Extraction over Evolving Text Data

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Optimizing SQL Queries over Text Databases

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Exploring a Few Good Tuples from Text Databases

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Organizing and searching the world wide web of facts - step one: the one-million fact extraction challenge

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
A probabilistic model of redundancy in information extraction

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Unsupervised named-entity extraction from the Web: An experimental study

Artificial Intelligence

Automatic rule refinement for information extraction

Proceedings of the VLDB Endowment
Building a generic debugger for information extraction pipelines

Proceedings of the 20th ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information extraction systems are increasingly being used to mine structured information from unstructured text documents. A commonly used unsupervised technique is to build iterative information extraction (IIE) systems that learn task-specific rules, called patterns, to generate the desired tuples. Oftentimes, output from an information extraction system may contain unexpected results which may be due to an incorrect pattern, incorrect tuple, or both. In such scenarios, users and developers of the extraction system could greatly benefit from an investigation tool that can quickly help them reason about and repair the output. In this paper, we develop an approach for interactive post-extraction investigation for IIE systems. We formalize three important phases of this investigation, namely, explain the IIE result, diagnose the influential and problematic components, and repair the output from an information extraction system. We show how to characterize the execution of an IIE system and build a suite of algorithms to answer questions pertaining to each of these phases. We experimentally evaluate our proposed approach over several domains over a Web corpus of about 500 million documents. We show that our approach effectively enables post-extraction investigation, while maximizing the gain from user and developer interaction.