Topic-Sensitive hidden-web crawling

Authors:
Panagiotis Liakos;Alexandros Ntoulas
Affiliations:
National and Kapodistrian University of Athens, Greece;National and Kapodistrian University of Athens, Greece,Zynga, San Francisco
Venue:
WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Year:
2012

Citing 16
Cited 0

Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Probe, count, and classify: categorizing hidden web databases

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Crawling for Domain-Speci.c Hidden Web Resources

WISE '03 Proceedings of the Fourth International Conference on Web Information Systems Engineering
An interactive clustering-based approach to integrating source query interfaces on the deep Web

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Downloading textual hidden web content through keyword queries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Query Selection Techniques for Efficient Crawling of Structured Web Sources

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
An adaptive crawler for locating hidden-Web entry points

Proceedings of the 16th international conference on World Wide Web
DeepBot: a focused crawler for accessing hidden web content

Proceedings of the 3rd international workshop on Data enginering issues in E-commerce and services: In conjunction with ACM Conference on Electronic Commerce (EC '07)
Distributed search over the hidden web: hierarchical database sampling and selection

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Routing Queries through a Peer-to-Peer InfoBeacons Network Using Information Retrieval Techniques

IEEE Transactions on Parallel and Distributed Systems
Google's Deep Web crawl

Proceedings of the VLDB Endowment
Query by document

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Purely URL-based topic classification

Proceedings of the 18th international conference on World wide web
Crawling Deep Web Using a New Set Covering Algorithm

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

A constantly growing amount of high-quality information is stored in pages coming from the Hidden Web. Such pages are accessible only through a query interface that a Hidden-Web site provides and may span a variety of topics. In order to provide centralized access to the Hidden Web, previous works have focused on query generation techniques that aim at downloading all content of a given Hidden Web site with the minimum cost. In certain settings however, we are interested in downloading only a specific part of such a site. For example, in a news database, a user may be interested in retrieving only sports articles but no politics. In this case, we need to make the best use of our resources in downloading only the portion of the Hidden Web site that we are interested in. In this paper, we study how we can build a topically-focused Hidden Web crawler that can autonomously extract topic-specific pages from the Hidden Web by searching only the subset that is related to the corresponding category. To this end, we present query generation techniques that take into account the topic that we are interested in. We propose a number of different crawling policies and we experimentally evaluate them with data from two popular sites.