A study on data deduplication in HPC storage systems

Authors:
Dirk Meister;Jürgen Kaiser;Andre Brinkmann;Toni Cortes;Michael Kuhn;Julian Kunkel
Affiliations:
Johannes Gutenberg-University, Mainz, Germany;Johannes Gutenberg-University, Mainz, Germany;Johannes Gutenberg-University, Mainz, Germany;Universitat Politècnica de Catalunya, Spain;University of Hamburg, Germany;University of Hamburg, Germany
Venue:
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2012

Citing 33
Cited 2

Analysis of file I/O traces in commercial computing environments

SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
File-Access Characteristics of Parallel Scientific Workloads

IEEE Transactions on Parallel and Distributed Systems
A large-scale study of file-system contents

SIGMETRICS '99 Proceedings of the 1999 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
File system usage in Windows NT 4.0

Proceedings of the seventeenth ACM symposium on Operating systems principles
A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Handbook of Applied Cryptography

Handbook of Applied Cryptography
An analysis of compare-by-hash

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Single instance storage in Windows® 2000

WSS'00 Proceedings of the 4th conference on USENIX Windows Systems Symposium - Volume 4
Ceph: a scalable, high-performance distributed file system

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Compare-by-hash: a reasoned analysis

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Jumbo store: providing efficient incremental upload and versioning for a utility rendering service

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
PVFS: a parallel file system for linux clusters

ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
A five-year study of file-system metadata

ACM Transactions on Storage (TOS)
Avoiding the disk bottleneck in the data domain deduplication file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Characterizing the I/O behavior of scientific applications on the Cray XT

PDSW '07 Proceedings of the 2nd international workshop on Petascale data storage: held in conjunction with Supercomputing '07
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
stdchk: A Checkpoint Storage System for Desktop Grid Computing

ICDCS '08 Proceedings of the 2008 The 28th International Conference on Distributed Computing Systems
Demystifying data deduplication

Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion
HYDRAstor: a Scalable Secondary Storage

FAST '09 Proccedings of the 7th conference on File and storage technologies
The design of a similarity based deduplication system

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
The effectiveness of deduplication on virtual machine disk images

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Multi-level comparison of data deduplication in a backup scenario

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Decentralized deduplication in SAN cluster file systems

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Venti: a new approach to archival storage

FAST'02 Proceedings of the 1st USENIX conference on File and storage technologies
GPFS: a shared-disk file system for large computing clusters

FAST'02 Proceedings of the 1st USENIX conference on File and storage technologies
Six degrees of scientific data: reading patterns for extreme scale science IO

Proceedings of the 20th international symposium on High performance distributed computing
A study of practical deduplication

ACM Transactions on Storage (TOS)
An empirical analysis of similarity in virtual machine images

Proceedings of the Middleware 2011 Industry Track Workshop
Characteristics of backup workloads in production systems

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
WAN optimized replication of backup datasets using stream-informed delta compression

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
On the viability of checkpoint compression for extreme scale fault tolerance

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
A universal algorithm for sequential data compression

IEEE Transactions on Information Theory

Block locality caching for data deduplication

Proceedings of the 6th International Systems and Storage Conference
Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud

ACM Transactions on Storage (TOS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Deduplication is a storage saving technique that is highly successful in enterprise backup environments. On a file system, a single data block might be stored multiple times across different files, for example, multiple versions of a file might exist that are mostly identical. With deduplication, this data replication is localized and redundancy is removed -- by storing data just once, all files that use identical regions refer to the same unique data. The most common approach splits file data into chunks and calculates a cryptographic fingerprint for each chunk. By checking if the fingerprint has already been stored, a chunk is classified as redundant or unique. Only unique chunks are stored. This paper presents the first study on the potential of data deduplication in HPC centers, which belong to the most demanding storage producers. We have quantitatively assessed this potential for capacity reduction for 4 data centers (BSC, DKRZ, RENCI, RWTH). In contrast to previous deduplication studies focusing mostly on backup data, we have analyzed over one PB (1212 TB) of online file system data. The evaluation shows that typically 20% to 30% of this online data can be removed by applying data deduplication techniques, peaking up to 70% for some data sets. This reduction can only be achieved by a subfile deduplication approach, while approaches based on whole-file comparisons only lead to small capacity savings.