Improving duplicate elimination in storage systems

Authors:
Deepak R. Bobbarjung;Suresh Jagannathan;Cezary Dubnicki
Affiliations:
Purdue University, West Lafayette, IN;Purdue University, West Lafayette, IN;NEC Laboratories America, Princeton, NJ
Venue:
ACM Transactions on Storage (TOS)
Year:
2006

Citing 20
Cited 21

RCS—a system for version control

Software—Practice & Experience
Data compression

ACM Computing Surveys (CSUR)
Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Delta algorithms: an empirical analysis

ACM Transactions on Software Engineering and Methodology (TOSEM)
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
The string-to-string correction problem with block moves

ACM Transactions on Computer Systems (TOCS)
OceanStore: an architecture for global-scale persistent storage

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Search and replication in unstructured peer-to-peer networks

ICS '02 Proceedings of the 16th international conference on Supercomputing
Compactly encoding unstructured inputs with differential compression

Journal of the ACM (JACM)
Cluster-Based Delta Compression of a Collection of Files

WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Erasure Coding Vs. Replication: A Quantitative Comparison

IPTPS '01 Revised Papers from the First International Workshop on Peer-to-Peer Systems
Towards an Archival Intermemory

ADL '98 Proceedings of the Advances in Digital Libraries Conference
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Pastiche: making backup cheap and easy

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Awarded Best Paper! - Venti: A New Approach to Archival Data Storage

FAST '02 Proceedings of the 1st USENIX Conference on File and Storage Technologies
Redundancy elimination within large collections of files

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Alternatives for detecting redundancy in storage systems data

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
TAPER: tiered approach for eliminating redundancy in replica synchronization

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Single instance storage in Windows® 2000

WSS'00 Proceedings of the 4th conference on USENIX Windows Systems Symposium - Volume 4

The design of a similarity based deduplication system

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Efficient locally trackable deduplication in replicated systems

Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
Malleable coding with edit-distance cost

ISIT'09 Proceedings of the 2009 IEEE international conference on Symposium on Information Theory - Volume 1
De-duplication-based archival storage system

CIT'09 Proceedings of the 3rd International Conference on Communications and information technology
Using transparent compression to improve SSD-based I/O caches

Proceedings of the 5th European conference on Computer systems
Efficient locally trackable deduplication in replicated systems

Middleware'09 Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware
Bimodal content defined chunking for backup streams

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
A running time improvement for the two thresholds two divisors algorithm

Proceedings of the 48th Annual Southeast Regional Conference
A data de-duplication access framework for solid state drives

Proceedings of the 2011 ACM Symposium on Applied Computing
Anchor-driven subchunk deduplication

Proceedings of the 4th Annual International Conference on Systems and Storage
DeFFS: Duplication-eliminated flash file system

Computers and Electrical Engineering
A two-phase differential synchronization algorithm for remote files

ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Hash challenges: Stretching the limits of compare-by-hash in distributed data deduplication

Information Processing Letters
Transparent Online Storage Compression at the Block-Level

ACM Transactions on Storage (TOS)
Characteristics of backup workloads in production systems

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
WAN optimized replication of backup datasets using stream-informed delta compression

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
TBF: a high-efficient query mechanism in de-duplication backup system

GPC'12 Proceedings of the 7th international conference on Advances in Grid and Pervasive Computing
WAN-optimized replication of backup datasets using stream-informed delta compression

ACM Transactions on Storage (TOS)
Fuzzy adaptive control for heterogeneous tasks in high-performance storage systems

Proceedings of the 6th International Systems and Storage Conference
SAFE: A Source Deduplication Framework for Efficient Cloud Backup Services

Journal of Signal Processing Systems
SBBS: A sliding blocking algorithm with backtracking sub-blocks for duplicate data detection

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Minimizing the amount of data that must be stored and managed is a key goal for any storage architecture that purports to be scalable. One way to achieve this goal is to avoid maintaining duplicate copies of the same data. Eliminating redundant data at the source by not writing data which has already been stored not only reduces storage overheads, but can also improve bandwidth utilization. For these reasons, in the face of today's exponentially growing data volumes, redundant data elimination techniques have assumed critical significance in the design of modern storage systems.Intelligent object partitioning techniques identify data that is new when objects are updated, and transfer only these chunks to a storage server. In this article, we propose a new object partitioning technique, called fingerdiff, that improves upon existing schemes in several important respects. Most notably, fingerdiff dynamically chooses a partitioning strategy for a data object based on its similarities with previously stored objects in order to improve storage and bandwidth utilization. We present a detailed evaluation of fingerdiff, and other existing object partitioning schemes, using a set of real-world workloads. We show that for these workloads, the duplicate elimination strategies employed by fingerdiff improve storage utilization on average by 25%, and bandwidth utilization on average by 40% over comparable techniques.