Redundancy elimination within large collections of files

Authors:
Purushottam Kulkarni;Fred Douglis;Jason LaVoie;John M. Tracey
Affiliations:
University of Massachusetts, Amherst, MA;IBM T. J. Watson Research Center, Hawthorne, NY;IBM T. J. Watson Research Center, Hawthorne, NY;IBM T. J. Watson Research Center, Hawthorne, NY
Venue:
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Year:
2004

Citing 17
Cited 58

RCS—a system for version control

Software—Practice & Experience
Data compression

ACM Computing Surveys (CSUR)
Potential benefits of delta encoding and data compression for HTTP

SIGCOMM '97 Proceedings of the ACM SIGCOMM '97 conference on Applications, technologies, architectures, and protocols for computer communication
A protocol-independent technique for eliminating redundant network traffic

Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Compactly encoding unstructured inputs with differential compression

Journal of the ACM (JACM)
Cluster-Based Delta Compression of a Collection of Files

WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Engineering a Differencing and Compression Data Format

ATEC '02 Proceedings of the General Track of the annual conference on USENIX Annual Technical Conference
Value-based web caching

WWW '03 Proceedings of the 12th international conference on World Wide Web
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Improved File Synchronization Techniques for Maintaining Large Replicated Collections over Slow Networks

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Pastiche: making backup cheap and easy

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Alternatives for detecting redundancy in storage systems data

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
An analysis of compare-by-hash

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Design, implementation, and evaluation of duplicate transfer detection in HTTP

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Single instance storage in Windows® 2000

WSS'00 Proceedings of the 4th conference on USENIX Windows Systems Symposium - Volume 4
Venti: a new approach to archival storage

FAST'02 Proceedings of the 1st USENIX conference on File and storage technologies

Automatic detection of fragments in dynamically generated web pages

Proceedings of the 13th international conference on World Wide Web
Deep Store: An Archival Storage System Architecture

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Automatic Fragment Detection in Dynamic Web Pages and Its Impact on Caching

IEEE Transactions on Knowledge and Data Engineering
Improving duplicate elimination in storage systems

ACM Transactions on Storage (TOS)
Efficient search in large textual collections with redundancy

Proceedings of the 16th international conference on World Wide Web
Alternatives for detecting redundancy in storage systems data

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
TAPER: tiered approach for eliminating redundancy in replica synchronization

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Randomized Protocols for Duplicate Elimination in Peer-to-Peer Storage Systems

IEEE Transactions on Parallel and Distributed Systems
Content-based document routing and index partitioning for scalable similarity-based searches in a large corpus

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
GreenFS: making enterprise computers greener by protecting them better

Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Implementation and performance evaluation of fuzzy file block matching

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Avoiding the disk bottleneck in the data domain deduplication file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Evaluating the usefulness of content addressable storage for high-performance data intensive applications

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Demystifying data deduplication

Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion
IZO: applications of large-window compression to virtual machine management

LISA'08 Proceedings of the 22nd conference on Large installation system administration conference
The design of a similarity based deduplication system

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Multi-level comparison of data deduplication in a backup scenario

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Compact full-text indexing of versioned document collections

Proceedings of the 18th ACM conference on Information and knowledge management
On compressing the textual web

Proceedings of the third ACM international conference on Web search and data mining
De-duplication-based archival storage system

CIT'09 Proceedings of the 3rd International Conference on Communications and information technology
Difference engine: harnessing memory redundancy in virtual machines

Communications of the ACM
Efficient similarity estimation for systems exploiting data redundancy

INFOCOM'10 Proceedings of the 29th conference on Information communications
I/O Deduplication: Utilizing content similarity to improve I/O performance

ACM Transactions on Storage (TOS)
I/O deduplication: utilizing content similarity to improve I/O performance

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
HydraFS: a high-throughput file system for the HYDRAstor content-addressable storage system

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Bimodal content defined chunking for backup streams

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Difference engine: harnessing memory redundancy in virtual machines

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Improved index compression techniques for versioned document collections

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
High throughput data redundancy removal algorithm with scalable performance

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Real-time approximate Range Motif discovery & data redundancy removal algorithm

Proceedings of the 14th International Conference on Extending Database Technology
A study of practical deduplication

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Tradeoffs in scalable data routing for deduplication clusters

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Leveraging value locality in optimizing NAND flash-based SSDs

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
A driver-layer caching policy for removable storage devices

ACM Transactions on Storage (TOS)
Fast file existence checking in archiving systems

ACM Transactions on Storage (TOS)
PRESIDIO: A Framework for Efficient Archival Data Storage

ACM Transactions on Storage (TOS)
Anchor-driven subchunk deduplication

Proceedings of the 4th Annual International Conference on Systems and Storage
Building a high-performance deduplication system

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Exposing file system mappings with MapFS

HotStorage'11 Proceedings of the 3rd USENIX conference on Hot topics in storage and file systems
Secure deduplication on mobile devices

Proceedings of the 2011 Workshop on Open Source and Design of Communication
What's the difference?: efficient set reconciliation without prior context

Proceedings of the ACM SIGCOMM 2011 conference
A study of practical deduplication

ACM Transactions on Storage (TOS)
Enhancing redundant network traffic elimination

Computer Networks: The International Journal of Computer and Telecommunications Networking
A two-phase differential synchronization algorithm for remote files

ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Automated detection of refactorings in evolving components

ECOOP'06 Proceedings of the 20th European conference on Object-Oriented Programming
Characteristics of backup workloads in production systems

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
WAN optimized replication of backup datasets using stream-informed delta compression

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Saga: a cost efficient file system based on cloud storage service

GECON'11 Proceedings of the 8th international conference on Economics of Grids, Clouds, Systems, and Services
Delta compressed and deduplicated storage using stream-informed locality

HotStorage'12 Proceedings of the 4th USENIX conference on Hot Topics in Storage and File Systems
Non-linear compression: Gzip Me Not!

HotStorage'12 Proceedings of the 4th USENIX conference on Hot Topics in Storage and File Systems
Optimizing positional index structures for versioned document collections

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
WAN-optimized replication of backup datasets using stream-informed delta compression

ACM Transactions on Storage (TOS)
Probabilistic deduplication for cluster-based storage systems

Proceedings of the Third ACM Symposium on Cloud Computing
Just-in-time provisioning for cyber foraging

Proceeding of the 11th annual international conference on Mobile systems, applications, and services
Power-reduction techniques for data-center storage systems

ACM Computing Surveys (CSUR)
SAFE: A Source Deduplication Framework for Efficient Cloud Backup Services

Journal of Signal Processing Systems
Linearly scalable crowdsourced media broadcasting in the mobile cloud

Proceedings of the 2013 workshop on Student workhop
Migratory compression: coarse-grained data reordering to improve compressibility

FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.02

Visualization

Abstract

Ongoing advancements in technology lead to ever-increasing storage capacities. In spite of this, optimizing storage usage can still provide rich dividends. Several techniques based on delta-encoding and duplicate block suppression have been shown to reduce storage overheads, with varying requirements for resources such as computation and memory. We propose a new scheme for storage reduction that reduces data sizes with an effectiveness comparable to the more expensive techniques, but at a cost comparable to the faster but less effective ones. The scheme, called Redundancy Elimination at the Block Level (REBL), leverages the benefits of compression, duplicate block suppression, and delta-encoding to eliminate a broad spectrum of redundant data in a scalable and efficient manner. REBL generally encodes more compactly than compression (up to a factor of 14) and a combination of compression and duplicate suppression (up to a factor of 6.7). REBL also encodes similarly to a technique based on delta-encoding, reducing overall space significantly in one case. Furthermore, REBL uses super-fingerprints, a technique that reduces the data needed to identify similar blocks while dramatically reducing the computational requirements of matching the blocks: it turns O(n2) comparisons into hash table lookups. As a result, using super-fingerprints to avoid enumerating matching data objects decreases computation in the resemblance detection phase of REBL by up to a couple orders of magnitude.