Join Optimization of Information Extraction Output: Quality Matters!

Authors:
Alpa Jain;Panagiotis G. Ipeirotis;AnHai Doan;Luis Gravano
Affiliations:
-;-;-;-
Venue:
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Year:
2009

Citing 0
Cited 10

Building query optimizers for information extraction: the SQoUT project

ACM SIGMOD Record
From information to knowledge: harvesting entities and relationships from web sources

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
SystemT: an algebraic approach to declarative information extraction

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Automatic rule refinement for information extraction

Proceedings of the VLDB Endowment
Scalable knowledge harvesting with high precision and high recall

Proceedings of the fourth ACM international conference on Web search and data mining
Self-supervised web search for any-k complete tuples

Proceedings of the 2nd International Workshop on Business intelligencE and the WEB
Chapter 3: search for knowledge

Search Computing
Just-in-time information extraction using extraction views

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Colledge: a vision of collaborative knowledge networks

Proceedings of the 2nd International Workshop on Semantic Search over the Web
Beyond search: Retrieving complete tuples from a text-database

Information Systems Frontiers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information extraction (IE) systems are trained to extract specific relations from text databases. Real-world applications often require that the output of multiple IE systems be joined to produce the data of interest. To optimize the execution of a join of multiple extracted relations, it is not sufficient to consider only execution time. In fact, the quality of the join output is of critical importance: unlike in the relational world, different join execution plans can produce join results of widely different quality whenever IE systems are involved. In this paper, we develop a principled approach to understand, estimate, and incorporate output quality into the join optimization process over extracted relations. We argue that the output quality is affected by (a) the configuration of the IE systems used to process documents, (b) the document retrieval strategies used to retrieve documents, and (c) the actual join algorithm used. Our analysis considers several alternatives for these factors, and predicts the output quality---and, of course, the execution time---of the alternate execution plans. We establish the accuracy of our analytical models, as well as study the effectiveness of a quality-aware join optimizer, with a large-scale experimental evaluation over real-world text collections and state-of-the-art IE systems.