A novel approach to data deduplication over the engineering-oriented cloud systems

Authors:
Zhe Sun;Jun Shen;Jianming Yong
Affiliations:
School of Information Systems and Technology, University of Wollongong, Wollongong, NSW, Australia and Information Management Center, Huaneng Shandong Shidao Bay Nuclear Power Company, Ltd, Longch ...;School of Information Systems and Technology, University of Wollongong, Wollongong, NSW, Australia;School of Information Systems, University of Southern Queensland, Toowoomba, QLD, Australia
Venue:
Integrated Computer-Aided Engineering
Year:
2013

Citing 21
Cited 1

Petal: distributed virtual disks

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Venti: A New Approach to Archival Storage

FAST '02 Proceedings of the Conference on File and Storage Technologies
Farsite: federated, available, and reliable storage for an incompletely trusted environment

ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
FAB: building distributed enterprise disk arrays from commodity components

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Ursa minor: versatile cluster-based storage

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
An analysis of compare-by-hash

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Compare-by-hash: a reasoned analysis

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Ceph: a scalable, high-performance distributed file system

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Scalable performance of the Panasas parallel file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Avoiding the disk bottleneck in the data domain deduplication file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
RADOS: a scalable, reliable storage service for petabyte-scale storage clusters

PDSW '07 Proceedings of the 2nd international workshop on Petascale data storage: held in conjunction with Supercomputing '07
Detecting data records in semi-structured web sites based on text token clustering

Integrated Computer-Aided Engineering
Sparse indexing: large scale, inline deduplication using sampling and locality

FAST '09 Proccedings of the 7th conference on File and storage technologies
HYDRAstor: a Scalable Secondary Storage

FAST '09 Proccedings of the 7th conference on File and storage technologies
A practical method for browsing a relational database using a standard search engine

Integrated Computer-Aided Engineering - Selected papers from the IEEE Conference on Information Reuse and Integration (IRI), July 13-15, 2008
Decentralized deduplication in SAN cluster file systems

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
MAD2: A scalable high-throughput exact deduplication approach for network backup services

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
A fuzzy approach for modelling non-stochastic heterogeneous data in engineering based on cluster analysis

Integrated Computer-Aided Engineering - Data Mining in Engineering
Agent-based cloud workflow execution

Integrated Computer-Aided Engineering - Anniversary Volume: Celebrating 20 Years of Excellence

Information dependability in distributed systems: The dependable distributed storage system

Integrated Computer-Aided Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a duplication-less storage system over the engineering-oriented cloud computing platforms. Our deduplication storage system, which manages data and duplication over the cloud system, consists of two major components, a front-end deduplication application and a mass storage system as back-end. Hadoop distributed file system HDFS is a common distribution file system on the cloud, which is used with Hadoop database HBase. We use HDFS to build up a mass storage system and employ HBase to build up a fast indexing system. With a deduplication application, a scalable and parallel deduplicated cloud storage system can be effectively built up. We further use VMware to generate a simulated cloud environment. The simulation results demonstrate that our deduplication storage system is sufficiently accurate and efficient for distributed and cooperative data intensive engineering applications.