Crawling Deep Web Using a New Set Covering Algorithm

  • Authors:
  • Yan Wang;Jianguo Lu;Jessica Chen

  • Affiliations:
  • School of Computer Science, University of Windsor, Windsor, Canada N9B 3P4;School of Computer Science, University of Windsor, Windsor, Canada N9B 3P4 and Key Lab of Novel Software Technology, Nanjing, China;School of Computer Science, University of Windsor, Windsor, Canada N9B 3P4

  • Venue:
  • ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Crawling the deep web often requires the selection of an appropriate set of queries so that they can cover most of the documents in the data source with low cost. This can be modeled as a set covering problem which has been extensively studied. The conventional set covering algorithms, however, do not work well when applied to deep web crawling due to various special features of this application domain. Typically, most set covering algorithms assume the uniform distribution of the elements being covered, while for deep web crawling, neither the sizes of documents nor the document frequencies of the queries is distributed uniformly. Instead, they follow the power law distribution. Hence, we have developed a new set covering algorithm that targets at web crawling. Compared to our previous deep web crawling method that uses a straightforward greedy set covering algorithm, it introduces weights into the greedy strategy. Our experiment carried out on a variety of corpora shows that this new method consistently outperforms its un-weighted version.