Data deduplication in a hybrid architecture for improving write performance

Authors:
Chao Chen;Jonathan Bastnagel;Yong Chen
Affiliations:
Texas Tech University Lubbock, TX;Texas Tech University Lubbock, TX;Texas Tech University Lubbock, TX
Venue:
Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
Year:
2013

Citing 17
Cited 0

Active disks: programming model, algorithms and evaluation

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Active Disks for Large-Scale Data Processing

Computer
A performance analysis of the Berkeley UPC compiler

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Data Sieving and Collective I/O in ROMIO

FRONTIERS '99 Proceedings of the The 7th Symposium on the Frontiers of Massively Parallel Computation
Active Disk File System: A Distributed, Scalable File System

MSS '01 Proceedings of the Eighteenth IEEE Symposium on Mass Storage Systems and Technologies
An Architecture for Fast Processing of Large Unstructured Data Sets

ICCD '04 Proceedings of the IEEE International Conference on Computer Design
Grid -Based Parallel Data Streaming implemented for the Gyrokinetic Toroidal Code

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Lerna: an active storage framework for flexible data access and management

HPDC '05 Proceedings of the High Performance Distributed Computing, 2005. HPDC-14. Proceedings. 14th IEEE International Symposium
Evaluation of active storage strategies for the lustre parallel file system

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Avoiding the disk bottleneck in the data domain deduplication file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Sparse indexing: large scale, inline deduplication using sampling and locality

FAST '09 Proccedings of the 7th conference on File and storage technologies
Making resonance a common case: A high-performance implementation of collective I/O on parallel file systems

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Design and evaluation of distributed smart disk architecture for I/O-intensive workloads

ICCS'03 Proceedings of the 2003 international conference on Computational science
Enabling active storage on parallel I/O software stacks

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
LACIO: A New Collective I/O Strategy for Parallel I/O Systems

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Enhancing I/O throughput via efficient routing and placement for large-scale parallel file systems

PCCC '11 Proceedings of the 30th IEEE International Performance Computing and Communications Conference
A Decoupled Execution Paradigm for Data-Intensive High-End Computing

CLUSTER '12 Proceedings of the 2012 IEEE International Conference on Cluster Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Big Data computing provides a promising new opportunity for scientific discoveries and innovations. However, it also poses a significant challenge to the high-end computing community. An effective I/O solution is urgently required to support big data applications run on high-end computing systems. In this study, we propose a new approach namely DDiHA, Data Deduplication in Hybrid Architecture, to improve the write performance for write-intensive big data applications. The DDiHA approach utilizes data deduplications to reduce the size of data volumes before they are transfered and written to the storage. A hybrid architecture is introduced to facilitate data deduplications. Both theoretical study and prototyping verification were conducted to evaluate the DDiHA approach. The initial results have shown that, given the same compute resources, the DDiHA system outperformed the conventional architecture, even though it introduces additional computation workload from data deduplications. The DDiHA approach reduces the data size transferred across the network and improves the I/O system performance. It has a promising potential for write-intensive big data applications.