Building query optimizers for information extraction: the SQoUT project
ACM SIGMOD Record
From information to knowledge: harvesting entities and relationships from web sources
Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
SystemT: an algebraic approach to declarative information extraction
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Automatic rule refinement for information extraction
Proceedings of the VLDB Endowment
Scalable knowledge harvesting with high precision and high recall
Proceedings of the fourth ACM international conference on Web search and data mining
Self-supervised web search for any-k complete tuples
Proceedings of the 2nd International Workshop on Business intelligencE and the WEB
Chapter 3: search for knowledge
Search Computing
Just-in-time information extraction using extraction views
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Colledge: a vision of collaborative knowledge networks
Proceedings of the 2nd International Workshop on Semantic Search over the Web
Beyond search: Retrieving complete tuples from a text-database
Information Systems Frontiers
Hi-index | 0.00 |
Information extraction (IE) systems are trained to extract specific relations from text databases. Real-world applications often require that the output of multiple IE systems be joined to produce the data of interest. To optimize the execution of a join of multiple extracted relations, it is not sufficient to consider only execution time. In fact, the quality of the join output is of critical importance: unlike in the relational world, different join execution plans can produce join results of widely different quality whenever IE systems are involved. In this paper, we develop a principled approach to understand, estimate, and incorporate output quality into the join optimization process over extracted relations. We argue that the output quality is affected by (a) the configuration of the IE systems used to process documents, (b) the document retrieval strategies used to retrieve documents, and (c) the actual join algorithm used. Our analysis considers several alternatives for these factors, and predicts the output quality---and, of course, the execution time---of the alternate execution plans. We establish the accuracy of our analytical models, as well as study the effectiveness of a quality-aware join optimizer, with a large-scale experimental evaluation over real-world text collections and state-of-the-art IE systems.