To search or to crawl?: towards a query optimizer for text-centric tasks

Authors:
Panagiotis G. Ipeirotis;Eugene Agichtein;Pranay Jain;Luis Gravano
Affiliations:
New York University;Microsoft Research;Columbia University;Columbia University
Venue:
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Year:
2006

Citing 25
Cited 30

Generating functionology

Generating functionology
Searching distributed collections with inference networks

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
A scalable comparison-shopping agent for the World-Wide Web

AGENTS '97 Proceedings of the first international conference on Autonomous agents
Random sampling for histogram construction: how much is enough?

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Automatic discovery of language models for text databases

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Optimization of queries with user-defined predicates

ACM Transactions on Database Systems (TODS)
GlOSS: text-source discovery over the Internet

ACM Transactions on Database Systems (TODS)
Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
The State of the Art in Text Filtering

User Modeling and User-Adapted Interaction
QProber: A system for automatic classification of hidden-Web databases

ACM Transactions on Information Systems (TOIS)
An Evaluation of Sampling-Based Size Estimation Methods for Selections in Database Systems

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Information Extraction: Techniques and Challenges

SCIE '97 International Summer School on Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Information extraction for enhanced access to disease outbreak reports

Journal of Biomedical Informatics - Special issue: Sublanguage
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Web-scale information extraction in knowitall: (preliminary results)

Proceedings of the 13th international conference on World Wide Web
Topical web crawlers: Evaluating adaptive algorithms

ACM Transactions on Internet Technology (TOIT)
Classifying and searching hidden-web text databases

Classifying and searching hidden-web text databases
A search engine for natural language applications

WWW '05 Proceedings of the 14th international conference on World Wide Web
Downloading textual hidden web content through keyword queries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries

DB&IR: both sides now

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Effective keyword-based selection of relational databases

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Towards a query optimizer for text-centric tasks

ACM Transactions on Database Systems (TODS)
Building structured web community portals: a top-down, compositional, and incremental approach

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Declarative information extraction using datalog with embedded extraction predicates

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Entity categorization over large document collections

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
On the provenance of non-answers to queries over extracted data

Proceedings of the VLDB Endowment
Scalable ad-hoc entity extraction from text collections

Proceedings of the VLDB Endowment
Defining imprecise regions using the web

Proceedings of the 2nd PhD workshop on Information and knowledge management
A quality-aware optimizer for information extraction

ACM Transactions on Database Systems (TODS)
Building query optimizers for information extraction: the SQoUT project

ACM SIGMOD Record
Conquering Language: Using NLP on a Massive Scale to Build High Dimensional Language Models from the Web

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
A Topic-Based Measure of Resource Description Quality for Distributed Information Retrieval

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Building a Graph of Names and Contextual Patterns for Named Entity Classification

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Efficiently incorporating user feedback into information extraction and integration programs

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Optimizing complex extraction programs over evolving text data

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Automated ontology instantiation from tabular web sources-The AllRight system

Web Semantics: Science, Services and Agents on the World Wide Web
Power-law based estimation of set similarity join size

Proceedings of the VLDB Endowment
Querying capability modeling and construction of deep web sources

WISE'07 Proceedings of the 8th international conference on Web information systems engineering
ALLRIGHT: automatic ontology instantiation from tabular web documents

ISWC'07/ASWC'07 Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference
SystemT: an algebraic approach to declarative information extraction

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Self-supervised web search for any-k complete tuples

Proceedings of the 2nd International Workshop on Business intelligencE and the WEB
Rules of thumb for information acquisition from large and redundant data

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
A multi-collection latent topic model for federated search

Information Retrieval
Real-time population of knowledge bases: opportunities and challenges

AKBC-WEKEX '12 Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction
Learning to crawl deep web

Information Systems
Beyond search: Retrieving complete tuples from a text-database

Information Systems Frontiers
Topical crawling on the web through local site-searches

Journal of Web Engineering
When speed has a price: fast information extraction using approximate algorithms

Proceedings of the VLDB Endowment
Selecting queries from sample to crawl deep web data sources

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or 'crawl," the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output "completeness" (e.g., in terms of recall). Nevertheless, this choice is typically ad-hoc and based on heuristics or plain intuition. In this paper, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query- and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated cost-model parameters. Overall, our approach helps predict the most appropriate execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.