The architecture and implementation of an extensible web crawler

  • Authors:
  • Jonathan M. Hsieh;Steven D. Gribble;Henry M. Levy

  • Affiliations:
  • Department of Computer Science & Engineering, University of Washington, Seattle, WA;Department of Computer Science & Engineering, University of Washington, Seattle, WA;Department of Computer Science & Engineering, University of Washington, Seattle, WA

  • Venue:
  • NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many Web services operate their own Web crawlers to discover data of interest, despite the fact that large-scale, timely crawling is complex, operationally intensive, and expensive. In this paper, we introduce the extensible crawler, a service that crawls the Web on behalf of its many client applications. Clients inject filters into the extensible crawler; the crawler evaluates all received filters against each Web page, notifying clients of matches. As a result, the act of crawling the Web is decoupled from determining whether a page is of interest, shielding client applications from the burden of crawling the Web themselves. This paper describes the architecture, implementation, and evaluation of our prototype extensible crawler, and also relates early experience from several crawler applications we have built. We focus on the challenges and trade-offs in the system, such as the design of a filter language that is simultaneously expressive and efficient to execute, the use of filter indexing to cheaply match a page against millions of filters, and the use of document and filter partitioning to scale our prototype implementation to high document throughput and large numbers of filters. We argue that the low-latency, high selectivity, and scalable nature of our system makes it a promising platform for taking advantage of emerging real-time streams of data, such as Facebook or Twitter feeds.