Efficient similarity estimation for systems exploiting data redundancy

Authors:
Kanat Tangwongsan;Himabindu Pucha;David G. Andersen;Michael Kaminsky
Affiliations:
Carnegie Mellon University;IBM Research Almaden;Carnegie Mellon University;Intel Labs Pittsburgh
Venue:
INFOCOM'10 Proceedings of the 29th conference on Information communications
Year:
2010

Citing 16
Cited 3

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
A protocol-independent technique for eliminating redundant network traffic

Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Wide-area cooperative storage with CFS

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Venti: A New Approach to Archival Storage

FAST '02 Proceedings of the Conference on File and Storage Technologies
Value-based web caching

WWW '03 Proceedings of the 12th international conference on World Wide Web
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Pastiche: making backup cheap and easy

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Redundancy elimination within large collections of files

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Alternatives for detecting redundancy in storage systems data

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Shark: scaling file servers via cooperative caching

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
An architecture for internet data transfer

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Supporting practical content-addressable caching with CZIP compression

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Adaptive file transfers for diverse environments

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Exploiting similarity for multi-source downloads using file handprints

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation

Wide-area network acceleration for the developing world

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
An empirical analysis of similarity in virtual machine images

Proceedings of the Middleware 2011 Industry Track Workshop
Probabilistic deduplication for cluster-based storage systems

Proceedings of the Third ACM Symposium on Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many modern systems exploit data redundancy to improve efficiency. These systems split data into chunks, generate identifiers for each of them, and compare the identifiers among other data items to identify duplicate chunks. As a result, chunk size becomes a critical parameter for the efficiency of these systems: it trades potentially improved similarity detection (smaller chunks) with increased overhead to represent more chunks. Unfortunately, the similarity between files increases unpredictably with smaller chunk sizes, even for data of the same type. Existing systems often pick one chunk size that is "good enough" for many cases because they lack efficient techniques to determine the benefits at other chunk sizes. This paper addresses this deficiency via two contributions: (1) we present multiresolution (MR) handprinting, an application-independent technique that efficiently estimates similarity between data items at different chunk sizes using a compact, multi-size representation of the data; (2) we then evaluate the application of MR handprints to workloads from peer-to-peer, file transfer, and storage systems, demonstrating that the chunk size selection enabled by MR handprints can lead to real improvements over using a fixed chunk size in these systems.