DeepBot: a focused crawler for accessing hidden web content

Authors:
Manuel Álvarez;Juan Raposo;Alberto Pan;Fidel Cacheda;Fernando Bellas;Víctor Carneiro
Affiliations:
University of A Coruña, Spain;University of A Coruña, Spain;University of A Coruña, Spain;University of A Coruña, Spain;University of A Coruña, Spain;University of A Coruña, Spain
Venue:
Proceedings of the 3rd international workshop on Data enginering issues in E-commerce and services: In conjunction with ACM Conference on Electronic Commerce (EC '07)
Year:
2007

Citing 10
Cited 6

QProber: A system for automatic classification of hidden-Web databases

ACM Transactions on Information Systems (TOIS)
Semi-Automatic Wrapper Generation for Commercial Web Sources

Proceedings of the IFIP TC8 / WG8.1 Working Conference on Engineering Information Systems in the Internet Context
Crawling for Domain-Speci.c Hidden Web Resources

WISE '03 Proceedings of the Fourth International Conference on Web Information Systems Engineering
Automatic integration of Web search interfaces with WISE-Integrator

The VLDB Journal — The International Journal on Very Large Data Bases
Structured databases on the web: observations and implications

ACM SIGMOD Record
Downloading textual hidden web content through keyword queries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Light-weight domain-based form assistant: querying web databases on the fly

VLDB '05 Proceedings of the 31st international conference on Very large data bases
DeepBot: a focused crawler for accessing hidden web content

Proceedings of the 3rd international workshop on Data enginering issues in E-commerce and services: In conjunction with ACM Conference on Electronic Commerce (EC '07)
Distributed search over the hidden web: hierarchical database sampling and selection

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Crawling web pages with support for client-side dynamism

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management

DeepBot: a focused crawler for accessing hidden web content

Proceedings of the 3rd international workshop on Data enginering issues in E-commerce and services: In conjunction with ACM Conference on Electronic Commerce (EC '07)
Research proposal for distributed deep web search

PIKM '10 Proceedings of the 3rd workshop on Ph.D. students in information and knowledge management
Deep Web adaptive crawling based on minimum executable pattern

Journal of Intelligent Information Systems
Topic-Sensitive hidden-web crawling

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
A Novel Architecture for Deep Web Crawler

International Journal of Information Technology and Web Engineering
Learning to crawl deep web

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The crawler engines of today cannot reach most of the information contained in the Web. A great amount of valuable information is "hidden" behind the query forms of online databases, and/or is dynamically generated by technologies such as Javascript. This portion of the web is usually known as the Deep Web or the Hidden Web. We have built DeepBot, a prototype of hidden-web focused crawler able to access such content. DeepBot receives a set of domain definitions as an input, each one describing a specific data-collecting task and automatically identifies and learns to execute queries on the forms relevant to them. In this paper we describe the techniques employed for building DeepBot and report the experimental results obtained when testing it with several real world data collection tasks.