Deep Web adaptive crawling based on minimum executable pattern

  • Authors:
  • Jun Liu;Lu Jiang;Zhaohui Wu;Qinghua Zheng

  • Affiliations:
  • MOE KLINNS Lab and SKLMS Lab, Xi'an Jiaotong University, Xi'an, People's Republic of China 710049;MOE KLINNS Lab and SKLMS Lab, Xi'an Jiaotong University, Xi'an, People's Republic of China 710049;MOE KLINNS Lab and SKLMS Lab, Xi'an Jiaotong University, Xi'an, People's Republic of China 710049;MOE KLINNS Lab and SKLMS Lab, Xi'an Jiaotong University, Xi'an, People's Republic of China 710049

  • Venue:
  • Journal of Intelligent Information Systems
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The key to Deep Web Crawling is to submit valid input values to a query form and retrieve Deep Web content efficiently. In the literature, related work focus only on generic text boxes or entire query forms, causing the problem of "data islands" or inferior validity of query submission. This paper proposes the concept of Minimum Executable Pattern (MEP), a minimal combination of elements in a query form that can conduct a successful query, and then presents a MEPGeneration method and a MEP-based Deep Web adaptive crawling method. The query form is parsed and partitioned into MEP set, and then local-optimal queries are generated by choosing a MEP in the MEP set and a keyword vector of the MEP. Furthermore, the crawler can make a decision on its termination to balance the trade-off between high coverage of the content and resource consumption. The adoption of MEP is expected to improve the validity of query submission, and adaptive selection of multiple MEPs shows good effect for overcoming the problem of "data islands". We present a set of experiments to validate the effectiveness of the proposed method. Experimental results show that our method outperforms the state of art methods in terms of query capability and applicability, and on average, it achieves good coverage by issuing only a few hundred queries.