A novel design of hidden web crawler using reinforcement learning based agents

Authors:
J. Akilandeswari;N. P. Gopalan
Affiliations:
Department of Computer Science and Engineering, Sona College of Technology, Salem, Tamil Nadu, India;Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamil Nadu, India
Venue:
APPT'07 Proceedings of the 7th international conference on Advanced parallel processing technologies
Year:
2007

Citing 8
Cited 0

Database techniques for the World-Wide Web: a survey

ACM SIGMOD Record
SPHINX: a framework for creating personal, site-specific Web crawlers

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Using Reinforcement Learning to Spider the Web Efficiently

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
An adaptive crawler for locating hidden-Web entry points

Proceedings of the 16th international conference on World Wide Web
Reinforcement learning: a survey

Journal of Artificial Intelligence Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

An ever-increasing amount of information on the Web today is available only through search interfaces: the users have to type in a set of keywords in a search form in order to access the pages from certain Web sites. These pages are often referred to as the Hidden Web or the Deep Web. Since there are no static links to the Hidden Web pages, search engines cannot discover and index such pages and thus do not return them in the results. However, according to recent studies, the content provided by many Hidden Web sites is often of very high quality and can be extremely valuable to many users. In this paper, an effective design of Hidden Web crawler ALAC that can autonomously discover pages from the Hidden Web is discussed. Here, a theoretical framework is presented to investigate the resource discovery problem. This article proposes an effective crawling strategy for identifying hidden web sites automatically. The crawler design employs agents fuelled with reinforcement learning. The prototype is experimentally evaluated for the effectiveness of the strategy and the results are very promising. The crawler ALAC has found 567 searchable forms after searching 3450 pages which substantiate the effectiveness of the policy.