Exploiting Multiple Features with MEMMs for Focused Web Crawling

  • Authors:
  • Hongyu Liu;Evangelos Milios;Larry Korba

  • Affiliations:
  • National Research Council Institute for Information Technology, Canada Faculty of Computer Science, Dalhousie University, Canada;National Research Council Institute for Information Technology, Canada Faculty of Computer Science, Dalhousie University, Canada;National Research Council Institute for Information Technology, Canada Faculty of Computer Science, Dalhousie University, Canada

  • Venue:
  • NLDB '08 Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information Systems
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Focused web crawling traverses the Web to collect documents on a specific topic. This is not an easy task, since focused crawlers need to identify the next most promising link to follow based on the topic and the content and links of previously crawled pages. In this paper, we present a framework based on Maximum Entropy Markov Models(MEMMs) for an enhanced focused web crawler to take advantage of richer representations of multiple features extracted from Web pages, such as anchor text and the keywords embedded in the link URL, to represent useful context. The key idea of our approach is to treat the focused web crawling problem as a sequential task and use a combination of content analysis and link structure to capture sequential patterns leading to targets. The experimental results showed that focused crawling using MEMMs is a very competitive crawler in general over Best-First crawling on Web Data in terms of two metrics: Precision and Maximum Average Similarity.