SAFE: A Source Deduplication Framework for Efficient Cloud Backup Services

Authors:
Yujuan Tan;Hong Jiang;Edwin Hsing-Mean Sha;Zhichao Yan;Dan Feng
Affiliations:
College of Computer Science, Chongqing University, Chongqing, China;Department of Computer Science & Engineering, University of Nebraska-Lincoln, Lincoln, USA;College of Computer Science, Chongqing University, Chongqing, China;Department of Computer Science & Engineering, University of Nebraska-Lincoln, Lincoln, USA;School of Computer Science & Technology, Huazhong University of Science & Technology, Wuhan, China
Venue:
Journal of Signal Processing Systems
Year:
2013

Citing 34
Cited 0

Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
ADSM: a multi-platform, scalable, backup and archive mass storage system

COMPCON '95 Proceedings of the 40th IEEE Computer Society International Conference
Farsite: federated, available, and reliable storage for an incompletely trusted environment

ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Deep Store: An Archival Storage System Architecture

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Improving duplicate elimination in storage systems

ACM Transactions on Storage (TOS)
Exploring patterns of social commonality among file directories at work

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Redundancy elimination within large collections of files

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Alternatives for detecting redundancy in storage systems data

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
TAPER: tiered approach for eliminating redundancy in replica synchronization

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Shark: scaling file servers via cooperative caching

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Randomized Protocols for Duplicate Elimination in Peer-to-Peer Storage Systems

IEEE Transactions on Parallel and Distributed Systems
An architecture for internet data transfer

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
A comparison of file system workloads

ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
A five-year study of file-system metadata

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Avoiding the disk bottleneck in the data domain deduplication file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
FARMER: a novel approach to file access correlation mining and evaluation reference model for optimizing peta-scale file system performance

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Fast, inexpensive content-addressed storage in foundation

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Efficient detection of large-scale redundancy in enterprise file systems

ACM SIGOPS Operating Systems Review
Cost minimization while satisfying hard/soft timing constraints for heterogeneous embedded systems

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Sparse indexing: large scale, inline deduplication using sampling and locality

FAST '09 Proccedings of the 7th conference on File and storage technologies
HYDRAstor: a Scalable Secondary Storage

FAST '09 Proccedings of the 7th conference on File and storage technologies
Multi-level comparison of data deduplication in a backup scenario

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
R-ADMAD: high reliability provision for large-scale de-duplication archival storage systems

Proceedings of the 23rd international conference on Supercomputing
Cumulus: Filesystem backup to the cloud

ACM Transactions on Storage (TOS)
ChunkStash: speeding up inline storage deduplication using flash memory

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
A study of practical deduplication

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Tradeoffs in scalable data routing for deduplication clusters

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Venti: a new approach to archival storage

FAST'02 Proceedings of the 1st USENIX conference on File and storage technologies
Integrating portable and distributed storage

FAST'04 Proceedings of the 3rd USENIX conference on File and storage technologies
Building a high-performance deduplication system

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
CABdedupe: A Causality-Based Deduplication Performance Booster for Cloud Backup Services

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Online optimization for scheduling preemptable tasks on IaaS cloud systems

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to the relatively low bandwidth of WAN that supports cloud backup services and the increasing amount of backed-up data stored at service providers, the deduplication scheme used in the cloud backup environment must remove the redundant data for backup operations to reduce backup times and storage costs and for restore operations to reduce restore times. In this paper, we propose SAFE, a source deduplication framework for efficient cloud backup and restore operations. SAFE consists of three salient features, (1) Hybrid Deduplication, combining the global file-level and local chunk-level deduplication to achieve an optimal tradeoff between the deduplication efficiency and overhead to achieve a short backup time; (2) Semantic-aware Elimination, exploiting file semantics to narrow the search space for the redundant data in hybrid deduplication process to reduce the deduplication overhead; and (3) Unmodified Data Removal, removing the files and data chunks that are kept intact from data transmission for some restore operations. Through extensive experiments driven by real-world datasets, the SAFE framework is shown to maintain a much higher deduplication efficiency/overhead ratio than existing solutions, shortening the backup time by an average of 38.7 %, and reduce the restore time by a ratio of up to 9.7 : 1.