Learning to crawl deep web

Authors:
Qinghua Zheng;Zhaohui Wu;Xiaocheng Cheng;Lu Jiang;Jun Liu
Affiliations:
MOE KLINNS Lab and SKLMS Lab, Xi'an Jiaotong University, No. 28, Xianning West Road, Xi'an 710049, China;MOE KLINNS Lab and SKLMS Lab, Xi'an Jiaotong University, No. 28, Xianning West Road, Xi'an 710049, China;MOE KLINNS Lab and SKLMS Lab, Xi'an Jiaotong University, No. 28, Xianning West Road, Xi'an 710049, China;MOE KLINNS Lab and SKLMS Lab, Xi'an Jiaotong University, No. 28, Xianning West Road, Xi'an 710049, China;MOE KLINNS Lab and SKLMS Lab, Xi'an Jiaotong University, No. 28, Xianning West Road, Xi'an 710049, China
Venue:
Information Systems
Year:
2013

Citing 29
Cited 0

Technical Note: \cal Q-Learning

Machine Learning
Incremental Learning With Sample Queries

IEEE Transactions on Pattern Analysis and Machine Intelligence
Information Retrieval: Computational and Theoretical Aspects

Information Retrieval: Computational and Theoretical Aspects
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Modern Information Retrieval

Modern Information Retrieval
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
DOM-based content extraction of HTML documents

WWW '03 Proceedings of the 12th international conference on World Wide Web
MARSYAS: a framework for audio analysis

Organised Sound
Lucene in Action (In Action series)

Lucene in Action (In Action series)
Downloading textual hidden web content through keyword queries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Feature-rich part-of-speech tagging with a cyclic dependency network

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Enriching the knowledge sources used in a maximum entropy part-of-speech tagger

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Query Selection Techniques for Efficient Crawling of Structured Web Sources

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
To search or to crawl?: towards a query optimizer for text-centric tasks

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
The portrait of a common HTML web page

Proceedings of the 2006 ACM symposium on Document engineering
Accessing the deep web

Communications of the ACM - ACM at sixty: a look back in time
DeepBot: a focused crawler for accessing hidden web content

Proceedings of the 3rd international workshop on Data enginering issues in E-commerce and services: In conjunction with ACM Conference on Electronic Commerce (EC '07)
Distributed search over the hidden web: hierarchical database sampling and selection

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Wise-integrator: an automatic integrator of web search interfaces for E-commerce

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Classification-aware hidden-web text database selection

ACM Transactions on Information Systems (TOIS)
Bootstrapping pay-as-you-go data integration systems

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Google's Deep Web crawl

Proceedings of the VLDB Endowment
Reinforcement learning: a survey

Journal of Artificial Intelligence Research
Learning Deep Web Crawling with Diverse Features

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Web Crawling

Foundations and Trends in Information Retrieval
Deep Web adaptive crawling based on minimum executable pattern

Journal of Intelligent Information Systems
Federated Search

Foundations and Trends in Information Retrieval
Efficient deep web crawling using reinforcement learning

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Deep web or hidden web refers to the hidden part of the Web (usually residing in structured databases) that remains unavailable for standard Web crawlers. Obtaining content of the deep web is challenging and has been acknowledged as a significant gap in the coverage of search engines. The paper proposes a novel deep web crawling framework based on reinforcement learning, in which the crawler is regarded as an agent and deep web database as the environment. The agent perceives its current state and selects an action (query) to submit to the environment (the deep web database) according to Q-value. While the existing methods rely on an assumption that all deep web databases possess full-text search interfaces and solely utilize the statistics (TF or DF) of acquired data records to generate the next query, the reinforcement learning framework not only enables crawlers to learn a promising crawling strategy from its own experience, but also allows for utilizing diverse features of query keywords. Experimental results show that the method outperforms the state of art methods in terms of crawling capability and relaxes the assumption of full-text search implied by existing methods.