Analysis of Workload Behavior in Scientific and Historical Long-Term Data Repositories

Authors:
Ian F. Adams;Mark W. Storer;Ethan L. Miller
Affiliations:
University of California, Santa Cruz;NetApp;University of California, Santa Cruz
Venue:
ACM Transactions on Storage (TOS)
Year:
2012

Citing 32
Cited 3

File archive activity in a supercomputing environment

ICS '93 Proceedings of the 7th international conference on Supercomputing
File system usage in Windows NT 4.0

Proceedings of the seventeenth ACM symposium on Operating systems principles
Long term file migration: development and evaluation of algorithms

Communications of the ACM
Venti: A New Approach to Archival Storage

FAST '02 Proceedings of the Conference on File and Storage Technologies
Massive arrays of idle disks for storage archives

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Analysis of Long-Term UNIX File Access Patterns for Application

Analysis of Long-Term UNIX File Access Patterns for Application
Energy conservation techniques for disk array-based servers

Proceedings of the 18th annual international conference on Supercomputing
The LOCKSS peer-to-peer digital preservation system

ACM Transactions on Computer Systems (TOCS)
Deep Store: An Archival Storage System Architecture

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Hibernator: helping disk arrays sleep through the winter

Proceedings of the twentieth ACM symposium on Operating systems principles
Stardust: tracking activity in a distributed storage system

SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
A fresh look at the reliability of long-term digital storage

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
BitVault: a highly reliable distributed data retention platform

ACM SIGOPS Operating Systems Review - Systems work at Microsoft Research
A cooperative internet backup scheme

ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
An analysis of latent sector errors in disk drives

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A comparison of file system workloads

ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
A five-year study of file-system metadata

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Analysis of Long Term File Reference Patterns for Application to File Migration Algorithms

IEEE Transactions on Software Engineering
POTSHARDS: secure long-term storage without encryption

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Pergamum: replacing tape with energy efficient, reliable, disk-based archival storage

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
An analysis of data corruption in the storage stack

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Avoiding the disk bottleneck in the data domain deduplication file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
A nine year study of file system and storage benchmarking

ACM Transactions on Storage (TOS)
Measurement and analysis of large-scale network file system workloads

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
DataSeries: an efficient, flexible data format for structured serial data

ACM SIGOPS Operating Systems Review
Generating realistic impressions for file-system benchmarking

FAST '09 Proccedings of the 7th conference on File and storage technologies
Capture, conversion, and analysis of an intense NFS workload

FAST '09 Proccedings of the 7th conference on File and storage technologies
Architecture of the internet archive

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
PLFS: a checkpoint filesystem for parallel applications

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Why traditional storage systems don't help us save stuff forever

HotDep'05 Proceedings of the First conference on Hot topics in system dependability

Usage behavior of a large-scale scientific archive

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Warming up storage-level caches with bonfire

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Parity logging with reserved space: towards efficient updates and recovery in erasure-coded clustered storage

FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

The scope of archival systems is expanding beyond cheap tertiary storage: scientific and medical data is increasingly digital, and the public has a growing desire to digitally record their personal histories. Driven by the increase in cost efficiency of hard drives, and the rise of the Internet, content archives have become a means of providing the public with fast, cheap access to long-term data. Unfortunately, designers of purpose-built archival systems are either forced to rely on workload behavior obtained from a narrow, anachronistic view of archives as simply cheap tertiary storage, or extrapolate from marginally related enterprise workload data and traditional library access patterns. To close this knowledge gap and provide relevant input for the design of effective long-term data storage systems, we studied the workload behavior of several systems within this expanded archival storage space. Our study examined several scientific and historical archives, covering a mixture of purposes, media types, and access models---that is, public versus private. Our findings show that, for more traditional private scientific archival storage, files have become larger, but update rates have remained largely unchanged. However, in the public content archives we observed, we saw behavior that diverges from the traditional “write-once, read-maybe” behavior of tertiary storage. Our study shows that the majority of such data is modified---sometimes unnecessarily---relatively frequently, and that indexing services such as Google and internal data management processes may routinely access large portions of an archive, accounting for most of the accesses. Based on these observations, we identify areas for improving the efficiency and performance of archival storage systems.