RCS—a system for version control
Software—Practice & Experience
Balanced multidimensional extendible hash tree
PODS '86 Proceedings of the fifth ACM SIGACT-SIGMOD symposium on Principles of database systems
The design and implementation of a log-structured file system
ACM Transactions on Computer Systems (TOCS)
LH: Linear Hashing for distributed files
SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
The data compression book (2nd ed.)
The data compression book (2nd ed.)
LH*—a scalable, distributed data structure
ACM Transactions on Database Systems (TODS)
Petal: distributed virtual disks
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Using content-derived names for configuration management
Proceedings of the 1997 symposium on Software reusability
Potential benefits of delta encoding and data compression for HTTP
SIGCOMM '97 Proceedings of the ACM SIGCOMM '97 conference on Applications, technologies, architectures, and protocols for computer communication
Efficient distributed backup with delta compression
Proceedings of the fifth workshop on I/O in parallel and distributed systems
Archival storage for digital libraries
Proceedings of the third ACM conference on Digital libraries
Min-wise independent permutations (extended abstract)
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
In-place reconstruction of delta compressed files
PODC '98 Proceedings of the seventeenth annual ACM symposium on Principles of distributed computing
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Finding related pages in the World Wide Web
WWW '99 Proceedings of the eighth international conference on World Wide Web
A prototype implementation of archival Intermemory
Proceedings of the fourth ACM conference on Digital libraries
Deciding when to forget in the Elephant file system
Proceedings of the seventeenth ACM symposium on Operating systems principles
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
ACM Computing Surveys (CSUR)
XMill: an efficient compressor for XML data
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Engineering the compression of massive tables: an experimental approach
SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Min-wise independent permutations
Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
OceanStore: an architecture for global-scale persistent storage
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
A low-bandwidth network file system
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Database Design
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Compactly encoding unstructured inputs with differential compression
Journal of the ACM (JACM)
Storage mappings for multidimensional linear dynamic hashing
PODS '83 Proceedings of the 2nd ACM SIGACT-SIGMOD symposium on Principles of database systems
GPFS: A Shared-Disk File System for Large Computing Clusters
FAST '02 Proceedings of the Conference on File and Storage Technologies
Venti: A New Approach to Archival Storage
FAST '02 Proceedings of the Conference on File and Storage Technologies
Data and Metadata Collections for Scientific Applications
HPCN Europe 2001 Proceedings of the 9th International Conference on High-Performance Computing and Networking
Cluster-Based Delta Compression of a Collection of Files
WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Engineering a Differencing and Compression Data Format
ATEC '02 Proceedings of the General Track of the annual conference on USENIX Annual Technical Conference
A large-scale study of the evolution of web pages
WWW '03 Proceedings of the 12th international conference on World Wide Web
Towards an Archival Intermemory
ADL '98 Proceedings of the Advances in Digital Libraries Conference
High-Availability LH* Schemes with Mirroring
COOPIS '96 Proceedings of the First IFCIS International Conference on Cooperative Information Systems
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Rules of Thumb in Data Engineering
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Long-term unix file system activity and the efficacy of automatic file migration
Long-term unix file system activity and the efficacy of automatic file migration
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Improving table compression with combinatorial optimization
Journal of the ACM (JACM)
Deep Store: An Archival Storage System Architecture
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Farsite: federated, available, and reliable storage for an incompletely trusted environment
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Taming aggressive replication in the Pangaea wide-area file system
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
File size distribution on UNIX systems: then and now
ACM SIGOPS Operating Systems Review
Pangaea: a symbiotic wide-area file system
EW 10 Proceedings of the 10th workshop on ACM SIGOPS European workshop
Finding near-duplicate web pages: a large-scale evaluation of algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Providing High Reliability in a Minimum Redundancy Archival Storage System
MASCOTS '06 Proceedings of the 14th IEEE International Symposium on Modeling, Analysis, and Simulation
Efficient archival data storage
Efficient archival data storage
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
Redundancy elimination within large collections of files
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
TAPER: tiered approach for eliminating redundancy in replica synchronization
FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
An analysis of compare-by-hash
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
File system design for an NFS file server appliance
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Bigtable: a distributed storage system for structured data
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
RadixZip: linear time compression of token streams
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Avoiding the disk bottleneck in the data domain deduplication file system
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Sparse indexing: large scale, inline deduplication using sampling and locality
FAST '09 Proccedings of the 7th conference on File and storage technologies
HYDRAstor: a Scalable Secondary Storage
FAST '09 Proccedings of the 7th conference on File and storage technologies
I/O Deduplication: Utilizing content similarity to improve I/O performance
ACM Transactions on Storage (TOS)
HydraFS: a high-throughput file system for the HYDRAstor content-addressable storage system
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Delta compressed and deduplicated storage using stream-informed locality
HotStorage'12 Proceedings of the 4th USENIX conference on Hot Topics in Storage and File Systems
Understanding data survivability in archival storage systems
Proceedings of the 5th Annual International Systems and Storage Conference
WAN-optimized replication of backup datasets using stream-informed delta compression
ACM Transactions on Storage (TOS)
Hi-index | 0.00 |
The ever-increasing volume of archival data that needs to be reliably retained for long periods of time and the decreasing costs of disk storage, memory, and processing have motivated the design of low-cost, high-efficiency disk-based storage systems. However, managed disk storage is still expensive. To further lower the cost, redundancy can be eliminated with the use of interfile and intrafile data compression. However, it is not clear what the optimal strategy for compressing data is, given the diverse collections of data. To create a scalable archival storage system that efficiently stores diverse data, we present PRESIDIO, a framework that selects from different space-reduction efficent storage methods (ESMs) to detect similarity and reduce or eliminate redundancy when storing objects. In addition, the framework uses a virtualized content addressable store (VCAS) that hides from the user the complexity of knowing which space-efficient techniques are used, including chunk-based deduplication or delta compression. Storing and retrieving objects are polymorphic operations independent of their content-based address. A new technique, harmonic super-fingerprinting, is also used for obtaining successively more accurate (but also more costly) measures of similarity to identify the existing objects in a very large data set that are most similar to an incoming new object. The PRESIDIO design, when reported earlier, had comprehensively introduced for the first time the notion of deduplication, which is now being offered as a service in storage systems by major vendors. As an aid to the design of such systems, we evaluate and present various parameters that affect the efficiency of a storage system using empirical data.