SmartCrawl: a new strategy for the exploration of the hidden web

Authors:
Augusto de Carvalho Fontes;Fábio Soares Silva
Affiliations:
Universidade Tiradentes;Universidade Tiradentes
Venue:
Proceedings of the 6th annual ACM international workshop on Web information and data management
Year:
2004

Citing 6
Cited 4

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The invisible Web: uncovering information sources search engines can't see

The invisible Web: uncovering information sources search engines can't see
Collecting hidden weeb pages for data extraction

Proceedings of the 4th international workshop on Web information and data management
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Design and Implementation of a High-Performance Distributed Web Crawler

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Automatic Information Discovery from the "Invisible Web"

ITCC '02 Proceedings of the International Conference on Information Technology: Coding and Computing

Adaptive focused crawling

The adaptive web
Discovering URLs through user feedback

Proceedings of the 20th ACM international conference on Information and knowledge management
Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes

ACM Transactions on the Web (TWEB)
Hidden-Web induced by client-side scripting: an empirical study

ICWE'13 Proceedings of the 13th international conference on Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The way current search engines work leaves a large amount of information available in the World Wide Web outside their catalogues. This is due to the fact that crawlers work by following hyperlinks and a few other references and ignore HTML forms. In this paper, we propose a search engine prototype that can retrieve information behind HTML forms by automatically generating queries for them. We describe the architecture, some implementation details and an experiment that proves that the information is not in fact indexed by current search engines.