I/O deduplication: utilizing content similarity to improve I/O performance

Authors:
Ricardo Koller;Raju Rangaswami
Affiliations:
School of Computing and Information Sciences, Florida International University;School of Computing and Information Sciences, Florida International University
Venue:
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Year:
2010

Citing 36
Cited 15

A system for adaptive disk rearrangement

Software—Practice & Experience
Doubly distorted mirrors

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Adaptive block rearrangement

ACM Transactions on Computer Systems (TOCS)
Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Minimizing Expected Head Movement in One-Dimensional and Two-Dimensional Mass Storage Systems

ACM Computing Surveys (CSUR)
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Distorted mirrors

PDIS '91 Proceedings of the first international conference on Parallel and distributed information systems
A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Configuring and Scheduling an Eager-Writing Disk Array for a Transaction Processing Workload

FAST '02 Proceedings of the Conference on File and Storage Technologies
Venti: A New Approach to Archival Storage

FAST '02 Proceedings of the Conference on File and Storage Technologies
Disk Shadowing

VLDB '88 Proceedings of the 14th International Conference on Very Large Data Bases
My Cache or Yours? Making Storage More Exclusive

ATEC '02 Proceedings of the General Track of the annual conference on USENIX Annual Technical Conference
Peabody: The Time Travelling Disk

MSS '03 Proceedings of the 20 th IEEE/11 th NASA Goddard Conference on Mass Storage Systems and Technologies (MSS'03)
Rules of Thumb in Data Engineering

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Memory resource management in VMware ESX server

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
ARC: A Self-Tuning, Low Overhead Replacement Cache

FAST '03 Proceedings of the 2nd USENIX Conference on File and Storage Technologies
Passive NFS Tracing of Email and Research Workloads

FAST '03 Proceedings of the 2nd USENIX Conference on File and Storage Technologies
FS2: dynamic data replication in free disk space for improving disk performance and energy consumption

Proceedings of the twentieth ACM symposium on Operating systems principles
The automatic improvement of locality in storage systems

ACM Transactions on Computer Systems (TOCS)
CLOCK-Pro: an effective improvement of the CLOCK replacement

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Redundancy elimination within large collections of files

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Second-tier cache management using write hints

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
TAPER: tiered approach for eliminating redundancy in replica synchronization

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Trading capacity for performance in a disk array

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
On multi-level exclusive caching: offline optimality and why promotions are better than demotions

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Avoiding the disk bottleneck in the data domain deduplication file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Fast, inexpensive content-addressed storage in foundation

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Measurement and analysis of large-scale network file system workloads

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
The case for active block layer extensions

ACM SIGOPS Operating Systems Review
Sparse indexing: large scale, inline deduplication using sampling and locality

FAST '09 Proccedings of the 7th conference on File and storage technologies
BORG: block-reORGanization for self-optimizing storage systems

FAST '09 Proccedings of the 7th conference on File and storage technologies
IBM System Storage San Volume Controller

IBM System Storage San Volume Controller
Evaluation techniques for storage hierarchies

IBM Systems Journal
Difference engine: harnessing memory redundancy in virtual machines

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Satori: enlightened page sharing

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Decentralized deduplication in SAN cluster file systems

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference

IOrchestrator: Improving the Performance of Multi-node I/O Systems via Inter-Server Coordination

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Leveraging value locality in optimizing NAND flash-based SSDs

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
A scheduling framework that makes any disk schedulers non-work-conserving solely based on request characteristics

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Cost effective storage using extent based dynamic tiering

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Virtually cool ternary content addressable memory

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
DeFFS: Duplication-eliminated flash file system

Computers and Electrical Engineering
FlashTier: a lightweight, consistent and durable storage cache

Proceedings of the 7th ACM european conference on Computer Systems
iDedup: latency-aware, inline data deduplication for primary storage

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Caching less for better performance: balancing cache size and update cost of flash memory cache in hybrid storage systems

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Software persistent memory

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Tradeoffs in compressing virtual machine checkpoints

Proceedings of the 7th international workshop on Virtualization technologies in distributed computing
Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud

ACM Transactions on Storage (TOS)
CareDedup: cache-aware deduplication for reading performance optimization in primary storage

Proceedings Demo & Poster Track of ACM/IFIP/USENIX International Middleware Conference
Write policies for host-side flash caches

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
(Big)data in a virtualized world: volume, velocity, and variety in cloud datacenters

FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Duplication of data in storage systems is becoming increasingly common. We introduce I/O Deduplication, a storage optimization that utilizes content similarity for improving I/O performance by eliminating I/O operations and reducing the mechanical delays during I/O operations. I/O Deduplication consists of three main techniques: content-based caching, dynamic replica retrieval, and selective duplication. Each of these techniques is motivated by our observations with I/O workload traces obtained from actively-used production storage systems, all of which revealed surprisingly high levels of content similarity for both stored and accessed data. Evaluation of a prototype implementation using these workloads revealed an overall improvement in disk I/O performance of 28-47% across these workloads. Further breakdown also showed that each of the three techniques contributed significantly to the overall performance improvement.