ChunkStash: speeding up inline storage deduplication using flash memory

Authors:
Biplob Debnath;Sudipta Sengupta;Jin Li
Affiliations:
University of Minnesota, Twin Cities;Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA
Venue:
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Year:
2010

Citing 28
Cited 27

The art of computer programming, volume 3: (2nd ed.) sorting and searching

The art of computer programming, volume 3: (2nd ed.) sorting and searching
A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Venti: A New Approach to Archival Storage

FAST '02 Proceedings of the Conference on File and Storage Technologies
Cuckoo hashing

Journal of Algorithms
Algorithms and data structures for flash memories

ACM Computing Surveys (CSUR)
Architecture-conscious hashing

DaMoN '06 Proceedings of the 2nd international workshop on Data management on new hardware
The Berkeley DB Book

The Berkeley DB Book
FlashDB: dynamic self-tuning database for NAND flash

Proceedings of the 6th international conference on Information processing in sensor networks
Microhash: an efficient index structure for fash-based sensor devices

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
A flash-memory based file system

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
A log buffer-based flash translation layer using fully-associative sector translation

ACM Transactions on Embedded Computing Systems (TECS)
Flash storage memory

Communications of the ACM - Web science
BPLRU: a buffer management scheme for improving random writes in flash storage

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Avoiding the disk bottleneck in the data domain deduplication file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Design tradeoffs for SSD performance

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Flashing up the storage layer

Proceedings of the VLDB Endowment
Online maintenance of very large random samples on flash storage

Proceedings of the VLDB Endowment
Gordon: using flash memory to build fast, power-efficient clusters for data-intensive applications

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
DFTL: a flash translation layer employing demand-based selective caching of page-level address mappings

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Sparse indexing: large scale, inline deduplication using sampling and locality

FAST '09 Proccedings of the 7th conference on File and storage technologies
HYDRAstor: a Scalable Secondary Storage

FAST '09 Proccedings of the 7th conference on File and storage technologies
FlashLogging: exploiting flash devices for synchronous logging performance

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
FAWN: a fast array of wimpy nodes

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
HydraFS: a high-throughput file system for the HYDRAstor content-addressable storage system

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Bimodal content defined chunking for backup streams

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Cheap and large CAMs for high performance data-intensive networked systems

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Decentralized deduplication in SAN cluster file systems

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
More Robust Hashing: Cuckoo Hashing with a Stash

SIAM Journal on Computing

FlashStore: high throughput persistent key-value store

Proceedings of the VLDB Endowment
CAFTL: a content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Leveraging value locality in optimizing NAND flash-based SSDs

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
SSDAlloc: hybrid SSD/RAM memory management made easy

Proceedings of the 8th USENIX conference on Networked systems design and implementation
SkimpyStash: RAM space skimpy key-value store on flash-based storage

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
SCMFS: a file system for storage class memory

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
GHOST: GPGPU-offloaded high performance storage I/O deduplication for primary storage system

Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
Live deduplication storage of virtual machine images in an open-source cloud

Middleware'11 Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware
WAN optimized replication of backup datasets using stream-informed delta compression

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Shredder: GPU-accelerated incremental storage and computation

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
iDedup: latency-aware, inline data deduplication for primary storage

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Caching less for better performance: balancing cache size and update cost of flash memory cache in hybrid storage systems

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
A study of space reclamation on flash-based append-only storage management

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications
Primary data deduplication-large scale study and system design

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
BVSSD: build built-in versioning flash-based solid state drives

Proceedings of the 5th Annual International Systems and Storage Conference
Reducing impact of data fragmentation caused by in-line deduplication

Proceedings of the 5th Annual International Systems and Storage Conference
WAN-optimized replication of backup datasets using stream-informed delta compression

ACM Transactions on Storage (TOS)
Droplet: A Distributed Solution of Data Deduplication

GRID '12 Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing
Live deduplication storage of virtual machine images in an open-source cloud

Proceedings of the 12th International Middleware Conference
Block locality caching for data deduplication

Proceedings of the 6th International Systems and Storage Conference
SCMFS: A File System for Storage Class Memory and its Extensions

ACM Transactions on Storage (TOS)
SAFE: A Source Deduplication Framework for Efficient Cloud Backup Services

Journal of Signal Processing Systems
Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud

ACM Transactions on Storage (TOS)
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Tango: distributed data structures over a shared log

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Triple-A: a Non-SSD based autonomic all-flash array for high performance storage systems

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Storage deduplication has received recent interest in the research community. In scenarios where the backup process has to complete within short time windows, inline deduplication can help to achieve higher backup throughput. In such systems, the method of identifying duplicate data, using disk-based indexes on chunk hashes, can create throughput bottlenecks due to disk I/Os involved in index lookups. RAM prefetching and bloom-filter based techniques used by Zhu et al. [42] can avoid disk I/Os on close to 99% of the index lookups. Even at this reduced rate, an index lookup going to disk contributes about 0.1msec to the average lookup time - this is about 1000 times slower than a lookup hitting in RAM. We propose to reduce the penalty of index lookup misses in RAM by orders of magnitude by serving such lookups from a flash-based index, thereby, increasing inline deduplication throughput. Flash memory can reduce the huge gap between RAM and hard disk in terms of both cost and access times and is a suitable choice for this application. To this end, we design a flash-assisted inline deduplication system using ChunkStash, a chunk metadata store on flash. ChunkStash uses one flash read per chunk lookup and works in concert with RAM prefetching strategies. It organizes chunk metadata in a log-structure on flash to exploit fast sequential writes. It uses an inmemory hash table to index them, with hash collisions resolved by a variant of cuckoo hashing. The in-memory hash table stores (2-byte) compact key signatures instead of full chunk-ids (20-byte SHA-1 hashes) so as to strike tradeoffs between RAM usage and false flash reads. Further, by indexing a small fraction of chunks per container, ChunkStash can reduce RAM usage significantly with negligible loss in deduplication quality. Evaluations using real-world enterprise backup datasets show that ChunkStash outperforms a hard disk index based inline deduplication system by 7x-60x on the metric of backup throughput (MB/sec).