Designing a fast file system crawler with incremental differencing

Authors:
Tim Bisson;Yuvraj Patel;Shankar Pasupathy
Affiliations:
NetApp Inc.;NetApp Inc.;NetApp Inc.
Venue:
ACM SIGOPS Operating Systems Review
Year:
2012

Citing 7
Cited 0

Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Parallel crawlers

Proceedings of the 11th international conference on World Wide Web
Mercator: A scalable, extensible Web crawler

World Wide Web
The Case for Higher-Level Power Management

HOTOS '99 Proceedings of the The Seventh Workshop on Hot Topics in Operating Systems
Design of a crawler with bounded bandwidth

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Mac OS X Internals

Mac OS X Internals

Quantified Score

Hi-index	0.00

Visualization

Abstract

Search engines for storage systems rely on crawlers to gather the list of files that need to be indexed. The recency of an index is determined by the speed at which this list can be gathered. While there has been a substantial amount of literature on building efficient web crawlers, there is very little literature on file system crawlers. In this paper we discuss the challenges in building a file system crawler. We then present the design of two file system crawlers: the first uses the standard POSIX file system API but carefully controls the amount of memory and CPU that it uses. The second leverages modifications to the file systems's internals, and a new API called SnapDiff, to detect modified files rapidly. For both crawlers we describe the incremental differencing design; the method to produce a list of changes between a previous crawl and the current point in time.