Efficient URL caching for world wide web crawling

Authors:
Andrei Z. Broder;Marc Najork;Janet L. Wiener
Affiliations:
IBM TJ Watson Research Center, Hawthorne, NY;Microsoft Research, Mountain View, CA;Hewlett Packard Labs, Palo Alto, CA
Venue:
WWW '03 Proceedings of the 12th international conference on World Wide Web
Year:
2003

Citing 22
Cited 17

An analysis of optimum caching

Journal of Algorithms
Handbook of algorithms and data structures: in Pascal and C (2nd ed.)

Handbook of algorithms and data structures: in Pascal and C (2nd ed.)
Operating systems (2nd ed.): design and implementation

Operating systems (2nd ed.): design and implementation
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
The art of computer programming, volume 3: (2nd ed.) sorting and searching

The art of computer programming, volume 3: (2nd ed.) sorting and searching
Online computation and competitive analysis

Online computation and competitive analysis
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
A technique for measuring the relative size and overlap of public Web search engines

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The space complexity of approximating the frequency moments

Journal of Computer and System Sciences
Synopsis data structures for massive data sets

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Graph structure in the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
An adaptive model for optimizing performance of an incremental web crawler

Proceedings of the 10th international conference on World Wide Web
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Parallel crawlers

Proceedings of the 11th international conference on World Wide Web
Algorithms in C

Algorithms in C
Mercator: A scalable, extensible Web crawler

World Wide Web
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Finding Frequent Items in Data Streams

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
A large-scale study of the evolution of web pages

WWW '03 Proceedings of the 12th international conference on World Wide Web
The Link Database: Fast Access to Graphs of the Web

DCC '02 Proceedings of the Data Compression Conference
Design and Implementation of a High-Performance Distributed Web Crawler

ICDE '02 Proceedings of the 18th International Conference on Data Engineering

Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
XPath lookup queries in P2P networks

Proceedings of the 6th annual ACM international workshop on Web information and data management
Approximately detecting duplicates for streaming data using stable bloom filters

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Automated gathering of Web information: An in-depth examination of agents interacting with search engines

ACM Transactions on Internet Technology (TOIT)
Storing and retrieving XPath fragments in structured P2P networks

Data & Knowledge Engineering - Special issue: WIDM 2004
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Efficient Monitoring Algorithm for Fast News Alerts

IEEE Transactions on Knowledge and Data Engineering
The Viúva Negra crawler: an experience report

Software—Practice & Experience
Modelling and simulation of the web graph: evaluating an exponential growth copying model

International Journal of Web Engineering and Technology
Finding duplicates in a data stream

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
IRLbot: Scaling to 6 billion pages and beyond

ACM Transactions on the Web (TWEB)
Improved approximate detection of duplicates for data streams over sliding windows

Journal of Computer Science and Technology
Web Crawling

Foundations and Trends in Information Retrieval
Fixing the threshold for effective detection of near duplicate web documents in web crawling

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Incremental web-site boundary detection using random walks

MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
Duplicate detection in pay-per-click streams using temporal stateful Bloom filters

International Journal of Data Analysis Techniques and Strategies
Adscape: harvesting and analyzing online display ads

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Crawling the web is deceptively simple: the basic algorithm is (a) Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)-(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further complicates the membership test.A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the "seen" URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests.Our main conclusion is that caching is very effective - in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.