Crawling the content hidden behind web forms

  • Authors:
  • Manuel Álvarez;Juan Raposo;Alberto Pan;Fidel Cacheda;Fernando Bellas;Víctor Carneiro

  • Affiliations:
  • Department of Information and Communications Technologies, University of A Coruña, A Coruña, Spain;Department of Information and Communications Technologies, University of A Coruña, A Coruña, Spain;Department of Information and Communications Technologies, University of A Coruña, A Coruña, Spain;Department of Information and Communications Technologies, University of A Coruña, A Coruña, Spain;Department of Information and Communications Technologies, University of A Coruña, A Coruña, Spain;Department of Information and Communications Technologies, University of A Coruña, A Coruña, Spain

  • Venue:
  • ICCSA'07 Proceedings of the 2007 international conference on Computational science and Its applications - Volume Part II
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

The crawler engines of today cannot reach most of the information contained in the Web. A great amount of valuable information is "hidden" behind the query forms of online databases, and/or is dynamically generated by technologies such as JavaScript. This portion of the web is usually known as the Deep Web or the Hidden Web. We have built DeepBot, a prototype hidden-web crawler able to access such content. DeepBot receives as input a set of domain definitions, each one describing a specific data-collecting task and automatically identifies and learns to execute queries on the forms relevant to them. In this paper we describe the techniques employed for building DeepBot and report the experimental results obtained when testing it with several real world data collection tasks.