Learning Deep Web Crawling with Diverse Features

Authors:
Lu Jiang;Zhaohui Wu;Qinghua Zheng;Jun Liu
Affiliations:
-;-;-;-
Venue:
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Year:
2009

Citing 8
Cited 5

Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
MARSYAS: a framework for audio analysis

Organised Sound
Downloading textual hidden web content through keyword queries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing)

Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing)
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Google's Deep Web crawl

Proceedings of the VLDB Endowment

Efficient deep web crawling using reinforcement learning

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Automatic discovery of Web Query Interfaces using machine learning techniques

Journal of Intelligent Information Systems
Boosting retrieval of digital spoken content

KES'12 Proceedings of the 16th international conference on Knowledge Engineering, Machine Learning and Lattice Computing with Applications
Learning to crawl deep web

Information Systems
Automatic classification of web databases using domain-dictionaries

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

The key to Deep Web crawling is to submit promising keywords to query form and retrieve Deep Web content efficiently. To select keywords, existing methods make a decision based on keywords’ statistic information deriving from TF and DF in local acquired records, thus work well only in textual databases providing full text search interfaces, whereas not well in structured databases of multi-attribute or field-restricted search interfaces. This paper proposes a novel Deep Web crawling method. Keywords are encoded as a tuple by its linguistic, statistic and HTML features so that a harvest rate evaluation model can be learned from the issued keywords for the un-issued in future. The method breaks through the assumption of plain-text search made by existing methods. Experimental results show that the method outperforms the state of the art methods.