Alternatives for detecting redundancy in storage systems data

Authors:
Calicrates Policroniades;Ian Pratt
Affiliations:
Computer Laboratory, Cambridge University, Cambridge, UK;Computer Laboratory, Cambridge University, Cambridge, UK
Venue:
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Year:
2004

Citing 18
Cited 35

What is a file synchronizer?

MobiCom '98 Proceedings of the 4th annual ACM/IEEE international conference on Mobile computing and networking
File system usage in Windows NT 4.0

Proceedings of the seventeenth ACM symposium on Operating systems principles
A protocol-independent technique for eliminating redundant network traffic

Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Data Replication in Mariposa

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
Storage, Mutability and Naming in Pasta

Revised Papers from the NETWORKING 2002 Workshops on Web Engineering and Peer-to-Peer Computing
Rumor: Mobile Data Access Through Optimistic Peer-to-Peer Replication

ER '98 Proceedings of the Workshops on Data Warehousing and Data Mining: Advances in Database Technologies
Value-based web caching

WWW '03 Proceedings of the 12th international conference on World Wide Web
Xenoservers: Accountable Execution of Untrusted Programs

HOTOS '99 Proceedings of the The Seventh Workshop on Hot Topics in Operating Systems
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Pastiche: making backup cheap and easy

ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Optimizing the migration of virtual computers

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Energy aware lossless data compression

Proceedings of the 1st international conference on Mobile systems, applications and services
Redundancy elimination within large collections of files

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
An analysis of compare-by-hash

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Single instance storage in Windows® 2000

WSS'00 Proceedings of the 4th conference on USENIX Windows Systems Symposium - Volume 4
A comparison of file system workloads

ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
Venti: a new approach to archival storage

FAST'02 Proceedings of the 1st USENIX conference on File and storage technologies

Deep Store: An Archival Storage System Architecture

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Deconstructing Commodity Storage Clusters

Proceedings of the 32nd annual international symposium on Computer Architecture
Improving duplicate elimination in storage systems

ACM Transactions on Storage (TOS)
Exploring patterns of social commonality among file directories at work

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Redundancy elimination within large collections of files

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Randomized Protocols for Duplicate Elimination in Peer-to-Peer Storage Systems

IEEE Transactions on Parallel and Distributed Systems
Supporting practical content-addressable caching with CZIP compression

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Implementation and performance evaluation of fuzzy file block matching

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Demystifying data deduplication

Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion
IZO: applications of large-window compression to virtual machine management

LISA'08 Proceedings of the 22nd conference on Large installation system administration conference
Sparse indexing: large scale, inline deduplication using sampling and locality

FAST '09 Proccedings of the 7th conference on File and storage technologies
The effectiveness of deduplication on virtual machine disk images

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Multi-level comparison of data deduplication in a backup scenario

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Efficient locally trackable deduplication in replicated systems

Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
Efficient locally trackable deduplication in replicated systems

Middleware'09 Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware
Efficient similarity estimation for systems exploiting data redundancy

INFOCOM'10 Proceedings of the 29th conference on Information communications
HydraFS: a high-throughput file system for the HYDRAstor content-addressable storage system

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Bimodal content defined chunking for backup streams

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
A study of practical deduplication

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Tradeoffs in scalable data routing for deduplication clusters

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Fast file existence checking in archiving systems

ACM Transactions on Storage (TOS)
Exploiting similarity for multi-source downloads using file handprints

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Anchor-driven subchunk deduplication

Proceedings of the 4th Annual International Conference on Systems and Storage
SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
A study of practical deduplication

ACM Transactions on Storage (TOS)
A two-phase differential synchronization algorithm for remote files

ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Characteristics of backup workloads in production systems

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
WAN optimized replication of backup datasets using stream-informed delta compression

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Teleporter: An analytically and forensically sound duplicate transfer system

Digital Investigation: The International Journal of Digital Forensics & Incident Response
Non-linear compression: Gzip Me Not!

HotStorage'12 Proceedings of the 4th USENIX conference on Hot Topics in Storage and File Systems
WAN-optimized replication of backup datasets using stream-informed delta compression

ACM Transactions on Storage (TOS)
Probabilistic deduplication for cluster-based storage systems

Proceedings of the Third ACM Symposium on Cloud Computing
CloudDT: efficient tape resource management using deduplication in cloud backup and archival services

Proceedings of the 8th International Conference on Network and Service Management
SAFE: A Source Deduplication Framework for Efficient Cloud Backup Services

Journal of Signal Processing Systems
Triple-A: a Non-SSD based autonomic all-flash array for high performance storage systems

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Storage systems frequently maintain identical copies of data. Identifying such data can assist in the design of solutions in which data storage, transmission, and management are optimised. In this paper we evaluate three methods used to discover identical portions of data: whole file content hashing, fixed size blocking, and a chunking strategy that uses Rabin fingerprints to delimit content-defined data chunks. We assess how effective each of these strategies is in finding identical sections of data. In our experiments, we analysed diverse data sets from a variety of different types of storage systems including a mirrored section of sunsite.org.uk, different data profiles in the file system infrastructure of the Cambridge University Computer Laboratory, source code distribution trees, compressed data, and packed files. We report our experimental results and present a comparative analysis of these techniques. This study also shows how levels of similarity differ between data sets and file types. Finally, we discuss the advantages and disadvantages in the application of these methods in the light of our experimental results.