Generating functionology
Searching distributed collections with inference networks
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
A scalable comparison-shopping agent for the World-Wide Web
AGENTS '97 Proceedings of the first international conference on Autonomous agents
Random sampling for histogram construction: how much is enough?
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Automatic discovery of language models for text databases
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Optimization of queries with user-defined predicates
ACM Transactions on Database Systems (TODS)
GlOSS: text-source discovery over the Internet
ACM Transactions on Database Systems (TODS)
Snowball: extracting relations from large plain-text collections
DL '00 Proceedings of the fifth ACM conference on Digital libraries
Query-based sampling of text databases
ACM Transactions on Information Systems (TOIS)
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Accelerated focused crawling through online relevance feedback
Proceedings of the 11th international conference on World Wide Web
The State of the Art in Text Filtering
User Modeling and User-Adapted Interaction
QProber: A system for automatic classification of hidden-Web databases
ACM Transactions on Information Systems (TOIS)
An Evaluation of Sampling-Based Size Estimation Methods for Selections in Database Systems
ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Information Extraction: Techniques and Challenges
SCIE '97 International Summer School on Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology
Focused Crawling Using Context Graphs
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Extracting Patterns and Relations from the World Wide Web
WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Information extraction for enhanced access to disease outbreak reports
Journal of Biomedical Informatics - Special issue: Sublanguage
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Web-scale information extraction in knowitall: (preliminary results)
Proceedings of the 13th international conference on World Wide Web
Topical web crawlers: Evaluating adaptive algorithms
ACM Transactions on Internet Technology (TOIT)
Classifying and searching hidden-web text databases
Classifying and searching hidden-web text databases
A search engine for natural language applications
WWW '05 Proceedings of the 14th international conference on World Wide Web
Downloading textual hidden web content through keyword queries
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Effective keyword-based selection of relational databases
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Towards a query optimizer for text-centric tasks
ACM Transactions on Database Systems (TODS)
Building structured web community portals: a top-down, compositional, and incremental approach
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Declarative information extraction using datalog with embedded extraction predicates
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Entity categorization over large document collections
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
On the provenance of non-answers to queries over extracted data
Proceedings of the VLDB Endowment
Scalable ad-hoc entity extraction from text collections
Proceedings of the VLDB Endowment
Defining imprecise regions using the web
Proceedings of the 2nd PhD workshop on Information and knowledge management
A quality-aware optimizer for information extraction
ACM Transactions on Database Systems (TODS)
Building query optimizers for information extraction: the SQoUT project
ACM SIGMOD Record
CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
A Topic-Based Measure of Resource Description Quality for Distributed Information Retrieval
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Building a Graph of Names and Contextual Patterns for Named Entity Classification
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Efficiently incorporating user feedback into information extraction and integration programs
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Optimizing complex extraction programs over evolving text data
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Automated ontology instantiation from tabular web sources-The AllRight system
Web Semantics: Science, Services and Agents on the World Wide Web
Power-law based estimation of set similarity join size
Proceedings of the VLDB Endowment
Querying capability modeling and construction of deep web sources
WISE'07 Proceedings of the 8th international conference on Web information systems engineering
ALLRIGHT: automatic ontology instantiation from tabular web documents
ISWC'07/ASWC'07 Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference
SystemT: an algebraic approach to declarative information extraction
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Self-supervised web search for any-k complete tuples
Proceedings of the 2nd International Workshop on Business intelligencE and the WEB
Rules of thumb for information acquisition from large and redundant data
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
A multi-collection latent topic model for federated search
Information Retrieval
Real-time population of knowledge bases: opportunities and challenges
AKBC-WEKEX '12 Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction
Information Systems
Beyond search: Retrieving complete tuples from a text-database
Information Systems Frontiers
Topical crawling on the web through local site-searches
Journal of Web Engineering
When speed has a price: fast information extraction using approximate algorithms
Proceedings of the VLDB Endowment
Selecting queries from sample to crawl deep web data sources
Web Intelligence and Agent Systems
Hi-index | 0.00 |
Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or 'crawl," the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output "completeness" (e.g., in terms of recall). Nevertheless, this choice is typically ad-hoc and based on heuristics or plain intuition. In this paper, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query- and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated cost-model parameters. Overall, our approach helps predict the most appropriate execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.