Efficient detection of large-scale redundancy in enterprise file systems

Authors:
George Forman;Kave Eshghi;Jaap Suermondt
Affiliations:
Hewlett-Packard Labs, Palo Alto, CA;Hewlett-Packard Labs, Palo Alto, CA;Hewlett-Packard Labs, Palo Alto, CA
Venue:
ACM SIGOPS Operating Systems Review
Year:
2009

Citing 4
Cited 6

Min-wise independent permutations

Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
Reclaiming Space from Duplicate Files in a Serverless Distributed File System

ICDCS '02 Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS'02)
Finding similar files in large document repositories

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Single instance storage in Windows® 2000

WSS'00 Proceedings of the 4th conference on USENIX Windows Systems Symposium - Volume 4

b-Bit minwise hashing

Proceedings of the 19th international conference on World wide web
Theory and applications of b-bit minwise hashing

Communications of the ACM
Rangoli: space management in deduplication environments

Proceedings of the 6th International Systems and Storage Conference
A scalable deduplication and garbage collection engine for incremental backup

Proceedings of the 6th International Systems and Storage Conference
SAFE: A Source Deduplication Framework for Efficient Cloud Backup Services

Journal of Signal Processing Systems
b-bit minwise hashing in practice

Proceedings of the 5th Asia-Pacific Symposium on Internetware

Quantified Score

Hi-index	0.02

Visualization

Abstract

In order to catch and reduce waste in the exponentially increasing demand for disk storage, we have developed very efficient technology to detect approximate duplication of large directory hierarchies. Such duplication can be caused, for example, by unnecessary mirroring of repositories by uncoordinated employees or departments. Identifying these duplicate or near-duplicate hierarchies allows appropriate action to be taken at a high level. For example, one could coordinate and consolidate multiple copies in one location.