Crawling the content hidden behind web forms

Authors:
Manuel Álvarez;Juan Raposo;Alberto Pan;Fidel Cacheda;Fernando Bellas;Víctor Carneiro
Affiliations:
Department of Information and Communications Technologies, University of A Coruña, A Coruña, Spain;Department of Information and Communications Technologies, University of A Coruña, A Coruña, Spain;Department of Information and Communications Technologies, University of A Coruña, A Coruña, Spain;Department of Information and Communications Technologies, University of A Coruña, A Coruña, Spain;Department of Information and Communications Technologies, University of A Coruña, A Coruña, Spain;Department of Information and Communications Technologies, University of A Coruña, A Coruña, Spain
Venue:
ICCSA'07 Proceedings of the 2007 international conference on Computational science and Its applications - Volume Part II
Year:
2007

Citing 9
Cited 9

QProber: A system for automatic classification of hidden-Web databases

ACM Transactions on Information Systems (TOIS)
Semi-Automatic Wrapper Generation for Commercial Web Sources

Proceedings of the IFIP TC8 / WG8.1 Working Conference on Engineering Information Systems in the Internet Context
Crawling for Domain-Speci.c Hidden Web Resources

WISE '03 Proceedings of the Fourth International Conference on Web Information Systems Engineering
Automatic integration of Web search interfaces with WISE-Integrator

The VLDB Journal — The International Journal on Very Large Data Bases
Structured databases on the web: observations and implications

ACM SIGMOD Record
Downloading textual hidden web content through keyword queries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Light-weight domain-based form assistant: querying web databases on the fly

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Distributed search over the hidden web: hierarchical database sampling and selection

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Crawling web pages with support for client-side dynamism

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management

Extracting lists of data records from semi-structured web pages

Data & Knowledge Engineering
From queries to search forms: an implementation

International Journal of Computer Applications in Technology
Turbo-charging hidden database samplers with overflowing queries and skew reduction

Proceedings of the 13th International Conference on Extending Database Technology
Finding and Extracting Data Records from Web Pages

Journal of Signal Processing Systems
Unbiased estimation of size and other aggregates over hidden web databases

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A conceptual framework for efficient web crawling in virtual integration contexts

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Optimal algorithms for crawling a hidden database in the web

Proceedings of the VLDB Endowment
Architecture specification of rule-based deep web crawler with indexer

International Journal of Knowledge and Web Intelligence
Selecting queries from sample to crawl deep web data sources

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The crawler engines of today cannot reach most of the information contained in the Web. A great amount of valuable information is "hidden" behind the query forms of online databases, and/or is dynamically generated by technologies such as JavaScript. This portion of the web is usually known as the Deep Web or the Hidden Web. We have built DeepBot, a prototype hidden-web crawler able to access such content. DeepBot receives as input a set of domain definitions, each one describing a specific data-collecting task and automatically identifies and learns to execute queries on the forms relevant to them. In this paper we describe the techniques employed for building DeepBot and report the experimental results obtained when testing it with several real world data collection tasks.