An analysis of optimum caching
Journal of Algorithms
Handbook of algorithms and data structures: in Pascal and C (2nd ed.)
Handbook of algorithms and data structures: in Pascal and C (2nd ed.)
Operating systems (2nd ed.): design and implementation
Operating systems (2nd ed.): design and implementation
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
The art of computer programming, volume 3: (2nd ed.) sorting and searching
The art of computer programming, volume 3: (2nd ed.) sorting and searching
Online computation and competitive analysis
Online computation and competitive analysis
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering
WWW7 Proceedings of the seventh international conference on World Wide Web 7
A technique for measuring the relative size and overlap of public Web search engines
WWW7 Proceedings of the seventh international conference on World Wide Web 7
The space complexity of approximating the frequency moments
Journal of Computer and System Sciences
Synopsis data structures for massive data sets
Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
An adaptive model for optimizing performance of an incremental web crawler
Proceedings of the 10th international conference on World Wide Web
Breadth-first crawling yields high-quality pages
Proceedings of the 10th international conference on World Wide Web
Proceedings of the 11th international conference on World Wide Web
Algorithms in C
Mercator: A scalable, extensible Web crawler
World Wide Web
The Evolution of the Web and Implications for an Incremental Crawler
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Finding Frequent Items in Data Streams
ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
A large-scale study of the evolution of web pages
WWW '03 Proceedings of the 12th international conference on World Wide Web
The Link Database: Fast Access to Graphs of the Web
DCC '02 Proceedings of the Data Compression Conference
Design and Implementation of a High-Performance Distributed Web Crawler
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages
Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
XPath lookup queries in P2P networks
Proceedings of the 6th annual ACM international workshop on Web information and data management
Approximately detecting duplicates for streaming data using stable bloom filters
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
ACM Transactions on Internet Technology (TOIT)
Storing and retrieving XPath fragments in structured P2P networks
Data & Knowledge Engineering - Special issue: WIDM 2004
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
Efficient Monitoring Algorithm for Fast News Alerts
IEEE Transactions on Knowledge and Data Engineering
The Viúva Negra crawler: an experience report
Software—Practice & Experience
Modelling and simulation of the web graph: evaluating an exponential growth copying model
International Journal of Web Engineering and Technology
Finding duplicates in a data stream
SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
IRLbot: Scaling to 6 billion pages and beyond
ACM Transactions on the Web (TWEB)
Improved approximate detection of duplicates for data streams over sliding windows
Journal of Computer Science and Technology
Foundations and Trends in Information Retrieval
Fixing the threshold for effective detection of near duplicate web documents in web crawling
ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Incremental web-site boundary detection using random walks
MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
Duplicate detection in pay-per-click streams using temporal stateful Bloom filters
International Journal of Data Analysis Techniques and Strategies
Adscape: harvesting and analyzing online display ads
Proceedings of the 23rd international conference on World wide web
Hi-index | 0.00 |
Crawling the web is deceptively simple: the basic algorithm is (a) Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)-(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further complicates the membership test.A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the "seen" URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests.Our main conclusion is that caching is very effective - in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.