Crawling for Domain-Speci.c Hidden Web Resources

Authors:
André Bergholz;Boris Chidlovskii
Affiliations:
-;-
Venue:
WISE '03 Proceedings of the Fourth International Conference on Web Information Systems Engineering
Year:
2003

Citing 0
Cited 12

Understanding Web query interfaces: best-effort parsing with hidden syntax

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Combining classifiers to identify online databases

Proceedings of the 16th international conference on World Wide Web
DeepBot: a focused crawler for accessing hidden web content

Proceedings of the 3rd international workshop on Data enginering issues in E-commerce and services: In conjunction with ACM Conference on Electronic Commerce (EC '07)
Automatically maintaining navigation sequences for querying semi-structured web sources

Data & Knowledge Engineering
Automatic Hidden Web Database Classification

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Automated ontology instantiation from tabular web sources-The AllRight system

Web Semantics: Science, Services and Agents on the World Wide Web
A hierarchical approach to model web query interfaces for web source integration

Proceedings of the VLDB Endowment
Querying capability modeling and construction of deep web sources

WISE'07 Proceedings of the 8th international conference on Web information systems engineering
Crawling the content hidden behind web forms

ICCSA'07 Proceedings of the 2007 international conference on Computational science and Its applications - Volume Part II
On building a search interface discovery system

RED'09 Proceedings of the 2nd international conference on Resource discovery
Topic-Sensitive hidden-web crawling

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Development of an intelligent distributed news retrieval system

International Journal of Knowledge-based and Intelligent Engineering Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Hidden Web, the part of the Web that remains unavailable for standard crawlers, has become an important research topic during recent years. Its size is estimated to 400 to 500 times larger than that of the Publicly Indexable Web (PIW). Furthermore, the information on the Hidden Web is assumed to be more structured, because it is usually stored in databases. In this paper we describe a crawler which starting from the PIW finds entry points into the Hidden Web. The crawler is domain-specific and is initialized with pre-classified documents and relevant keywords. We describe our approach to the automatic identification of Hidden Web resources among encountered HTML forms. We conduct a series of experiments using thetop-level categories in the Google Directory and report our analysis of the discovered Hidden Web resources.