Distributed search over the hidden web: hierarchical database sampling and selection

Authors:
Panagiotis G. Ipeirotis;Luis Gravano
Affiliations:
Columbia University;Columbia University
Venue:
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Year:
2002

Citing 27
Cited 70

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
C4.5: programs for machine learning

C4.5: programs for machine learning
Searching distributed collections with inference networks

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Content routing: a scalable architecture for network-based information discovery

Content routing: a scalable architecture for network-based information discovery
STARTS: Stanford proposal for Internet meta-searching

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
The SMART and SIRE experimental retrieval systems

Readings in information retrieval
Statistical methods for speech recognition

Statistical methods for speech recognition
Effective retrieval with distributed collections

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Methods for information server selection

ACM Transactions on Information Systems (TOIS)
Automatic discovery of language models for text databases

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Comparing the performance of database selection algorithms

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Cluster-based language models for distributed retrieval

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Scalable collection summarization and selection

Proceedings of the fourth ACM conference on Digital libraries
A decision-theoretic approach to database selection in networked IR

ACM Transactions on Information Systems (TOIS)
GlOSS: text-source discovery over the Internet

ACM Transactions on Database Systems (TODS)
Server selection on the World Wide Web

DL '00 Proceedings of the fifth ACM conference on Digital libraries
The impact of database selection on distributed searching

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Query routing for Web search engines: architectures and experiments

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Collection selection and results merging with topically organized U.S. patents and TREC data

Proceedings of the ninth international conference on Information and knowledge management
Probe, count, and classify: categorizing hidden web databases

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
SDLIP + STARTS = SDARTS a protocol and toolkit for metasearching

Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Determining Text Databases to Search in the Internet

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Server Ranking for Distributed Text Retrieval Systems on the Internet

Proceedings of the Fifth International Conference on Database Systems for Advanced Applications (DASFAA)
Learning trees and rules with set-valued features

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1

QProber: A system for automatic classification of hidden-Web databases

ACM Transactions on Information Systems (TOIS)
Web application security assessment by fault injection and behavior monitoring

WWW '03 Proceedings of the 12th international conference on World Wide Web
Relevant document distribution estimation method for resource selection

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Information sharing across private databases

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Comparing the performance of collection selection algorithms

ACM Transactions on Information Systems (TOIS)
A semisupervised learning method to merge search engine results

ACM Transactions on Information Systems (TOIS)
Learning query languages of Web interfaces

Proceedings of the 2004 ACM symposium on Applied computing
A Probabilistic Approach to Metasearching with Adaptive Probing

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
A Frequency-based Approach for Mining Coverage Statistics in Data Integration

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
When one sample is not enough: improving text database selection using shrinkage

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Web Searching and Information Retrieval

Computing in Science and Engineering
Organizing structured web sources by query schemas: a clustering approach

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Unified utility maximization framework for resource selection

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Guiding queries to information sources with InfoBeacons

Proceedings of the 5th ACM/IFIP/USENIX international conference on Middleware
Modeling and Managing Content Changes in Text Databases

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Downloading textual hidden web content through keyword queries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
On search in peer-to-peer file sharing systems

Proceedings of the 2005 ACM symposium on Applied computing
A testing framework for Web application security assessment

Computer Networks: The International Journal of Computer and Telecommunications Networking - Web security
Information source selection for resource constrained environments

ACM SIGMOD Record
Two-stage statistical language models for text database selection

Information Retrieval
Estimating required recall for successful knowledge acquisition from the web

Proceedings of the 15th international conference on World Wide Web
Automatic structured query transformation over distributed digital libraries

Proceedings of the 2006 ACM symposium on Applied computing
Capturing collection size for distributed non-cooperative retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Identifying redundant search engines in a very large scale metasearch engine context

WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
"What is a good digital library?" - A quality model for digital libraries

Information Processing and Management: an International Journal
Effective keyword-based selection of relational databases

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Modeling and managing changes in text databases

ACM Transactions on Database Systems (TODS)
Distributed text retrieval from overlapping collections

ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Using query logs to establish vocabularies in distributed information retrieval

Information Processing and Management: an International Journal
Learning to rank collections

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
DeepBot: a focused crawler for accessing hidden web content

Proceedings of the 3rd international workshop on Data enginering issues in E-commerce and services: In conjunction with ACM Conference on Electronic Commerce (EC '07)
Towards a query optimizer for text-centric tasks

ACM Transactions on Database Systems (TODS)
Routing Queries through a Peer-to-Peer InfoBeacons Network Using Information Retrieval Techniques

IEEE Transactions on Parallel and Distributed Systems
Adaptive-sampling algorithms for answering aggregation queries on Web sites

Data & Knowledge Engineering
Classification-aware hidden-web text database selection

ACM Transactions on Information Systems (TOIS)
Discovering gis sources on the web using summaries

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Efficient Top-k Data Sources Ranking for Query on Deep Web

WISE '08 Proceedings of the 9th international conference on Web Information Systems Engineering
Google's Deep Web crawl

Proceedings of the VLDB Endowment
Automatic wrapper induction from hidden-web sources with domain knowledge

Proceedings of the 10th ACM workshop on Web information and data management
A comparison of techniques for estimating IDF values to generate lexical signatures for the web

Proceedings of the 10th ACM workshop on Web information and data management
Facilitating discovery on the private web using dataset digests

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Web-scale extraction of structured data

ACM SIGMOD Record
Privacy preservation of aggregates in hidden databases: why and how?

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Classification-based resource selection

Proceedings of the 18th ACM conference on Information and knowledge management
A testing framework for Web application security assessment

Computer Networks: The International Journal of Computer and Telecommunications Networking - Web security
Kosmix: high-performance topic exploration using the deep web

Proceedings of the VLDB Endowment
Turbo-charging hidden database samplers with overflowing queries and skew reduction

Proceedings of the 13th International Conference on Extending Database Technology
Supporting keyword queries on structured databases with limited search interfaces

DASFAA'08 Proceedings of the 13th international conference on Database systems for advanced applications
Crawling the content hidden behind web forms

ICCSA'07 Proceedings of the 2007 international conference on Computational science and Its applications - Volume Part II
Unbiased estimation of size and other aggregates over hidden web databases

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Facilitating discovery on the private web using dataset digests

International Journal of Metadata, Semantics and Ontologies
Structured data on the web

Communications of the ACM
Just-in-time analytics on large file systems

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Deep Web adaptive crawling based on minimum executable pattern

Journal of Intelligent Information Systems
Federated Search

Foundations and Trends in Information Retrieval
Facet discovery for structured web search: a query-log mining approach

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Attribute domain discovery for hidden web databases

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Crawling web pages with support for client-side dynamism

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Using information retrieval techniques to route queries in an infobeacons network

DBISP2P'04 Proceedings of the Second international conference on Databases, Information Systems, and Peer-to-Peer Computing
Index-Based keyword search in mediator systems

EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology
Efficient deep web crawling using reinforcement learning

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Optimal algorithms for crawling a hidden database in the web

Proceedings of the VLDB Endowment
Shard ranking and cutoff estimation for topically partitioned collections

Proceedings of the 21st ACM international conference on Information and knowledge management
Federated search in the wild: the combined power of over a hundred search engines

Proceedings of the 21st ACM international conference on Information and knowledge management
Topic-Sensitive hidden-web crawling

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Learning to crawl deep web

Information Systems
Discovering interesting information with advances in web technology

ACM SIGKDD Explorations Newsletter
Rank discovery from web databases

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.02

Visualization

Abstract

Many valuable text databases on the web have non-crawlable contents that are "hidden" behind search interfaces. Metasearchers are helpful tools for searching over many such databases at once through a unified query interface. A critical task for a metasearcher to process a query efficiently and effectively is the selection of the most promising databases for the query, a task that typically relies on statistical summaries of the database contents. Unfortunately, web-accessible text databases do not generally export content summaries. In this paper, we present an algorithm to derive content summaries from "uncooperative" databases by using "focused query probes," which adaptively zoom in on and extract documents that are representative of the topic coverage of the databases. Our content summaries are the first to include absolute document frequency estimates for the database words. We also present a novel database selection algorithm that exploits both the extracted content summaries and a hierarchical classification of the databases, automatically derived during probing, to compensate for potentially incomplete content summaries. Finally, we evaluate our techniques thoroughly using a variety of databases, including 50 real web-accessible text databases. Our experiments indicate that our new content-summary construction technique is efficient and produces more accurate summaries than those from previously proposed strategies. Also, our hierarchical database selection algorithm exhibits significantly higher precision than its flat counterparts.