Efficient deep web crawling using reinforcement learning

Authors:
Lu Jiang;Zhaohui Wu;Qian Feng;Jun Liu;Qinghua Zheng
Affiliations:
MOE KLINNS Lab and SKLMS Lab, Xi'an Jiaotong University, Xi'an, P.R.China;MOE KLINNS Lab and SKLMS Lab, Xi'an Jiaotong University, Xi'an, P.R.China;MOE KLINNS Lab and SKLMS Lab, Xi'an Jiaotong University, Xi'an, P.R.China;MOE KLINNS Lab and SKLMS Lab, Xi'an Jiaotong University, Xi'an, P.R.China;MOE KLINNS Lab and SKLMS Lab, Xi'an Jiaotong University, Xi'an, P.R.China
Venue:
PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Year:
2010

Citing 12
Cited 7

Technical Note: \cal Q-Learning

Machine Learning
Incremental Learning With Sample Queries

IEEE Transactions on Pattern Analysis and Machine Intelligence
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
MARSYAS: a framework for audio analysis

Organised Sound
Downloading textual hidden web content through keyword queries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Query Selection Techniques for Efficient Crawling of Structured Web Sources

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Distributed search over the hidden web: hierarchical database sampling and selection

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Google's Deep Web crawl

Proceedings of the VLDB Endowment
An Approach to Deep Web Crawling by Sampling

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Reinforcement learning: a survey

Journal of Artificial Intelligence Research
Learning Deep Web Crawling with Diverse Features

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01

A QIIIEP based domain specific hidden web crawler

Proceedings of the International Conference & Workshop on Emerging Trends in Technology
Automatic discovery of Web Query Interfaces using machine learning techniques

Journal of Intelligent Information Systems
A Novel Architecture for Deep Web Crawler

International Journal of Information Technology and Web Engineering
Learning to crawl deep web

Information Systems
Automatic classification of web databases using domain-dictionaries

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition
Formal concept analysis approach for data extraction from a limited deep web database

Journal of Intelligent Information Systems
Selecting queries from sample to crawl deep web data sources

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Deep web refers to the hidden part of the Web that remains unavailable for standard Web crawlers. To obtain content of Deep Web is challenging and has been acknowledged as a significant gap in the coverage of search engines. To this end, the paper proposes a novel deep web crawling framework based on reinforcement learning, in which the crawler is regarded as an agent and deep web database as the environment. The agent perceives its current state and selects an action (query) to submit to the environment according to Q-value. The framework not only enables crawlers to learn a promising crawling strategy from its own experience, but also allows for utilizing diverse features of query keywords. Experimental results show that the method outperforms the state of art methods in terms of crawling capability and breaks through the assumption of full-text search implied by existing methods.