File routing middleware for cloud deduplication

Authors:
Petros Efstathopoulos
Affiliations:
Symantec Research Labs, Symantec Corporation, Culver City, CA
Venue:
Proceedings of the 2nd International Workshop on Cloud Computing Platforms
Year:
2012

Citing 9
Cited 1

A scalable content-addressable network

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems

Middleware '01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg
Cuckoo hashing

Journal of Algorithms
Avoiding the disk bottleneck in the data domain deduplication file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Sparse indexing: large scale, inline deduplication using sampling and locality

FAST '09 Proccedings of the 7th conference on File and storage technologies
A study of practical deduplication

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Tradeoffs in scalable data routing for deduplication clusters

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Building a high-performance deduplication system

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

A scalable inline cluster deduplication framework for big data protection

Proceedings of the 13th International Middleware Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Deduplication technology is maturing and becoming a standard feature of most storage architectures. Many approaches have been proposed to address the deduplication scalability challenges of privately owned storage infrastructure, but as storage is moving to the cloud, deduplication mechanisms are expected to scale to thousands of storage nodes. Currently available solutions were not designed to handle such large scale, while research and practical experience suggests that aiming for global deduplication among thousands of nodes will, almost certainly, lead to high complexity, reduced performance and reduced reliability. Instead, we propose the idea of performing local deduplication operations within each cloud node, and introduce file similarity metrics to determine which node is the best deduplication host for a particular incoming file. This approach reduces the problem of scalable cloud deduplication to a file routing problem, which we can address using a software layer capable of making the necessary routing decisions. Using the proposed file routing middleware layer the system can achieve three important properties: scale to thousands of nodes, support almost any type of underlying storage node, and make the most of file-level deduplication.