DataSeries: an efficient, flexible data format for structured serial data

Authors:
Eric Anderson;Martin Arlitt;Charles B. Morrey, III;Alistair Veitch
Affiliations:
HP Labs, Palo Alto, CA;HP Labs, Palo Alto, CA;HP Labs, Palo Alto, CA;HP Labs, Palo Alto, CA
Venue:
ACM SIGOPS Operating Systems Review
Year:
2009

Citing 14
Cited 11

Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
Optimal tracing and replay for debugging shared-memory parallel programs

PADD '93 Proceedings of the 1993 ACM/ONR workshop on Parallel and distributed debugging
On the self-similar nature of Ethernet traffic (extended version)

IEEE/ACM Transactions on Networking (TON)
Cluster I/O with River: making the fast case common

Proceedings of the sixth workshop on I/O in parallel and distributed systems
A study of memory system performance of multimedia applications

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Data Mining and Knowledge Discovery
Empirical evaluation of multi-level buffer cache collaboration for storage systems

SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Crash Data Collection: A Windows Case Study

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Awarded Best Paper! - Using MEMS-Based Storage in Disk Arrays

FAST '03 Proceedings of the 2nd USENIX Conference on File and Storage Technologies
Passive NFS Tracing of Email and Research Workloads

FAST '03 Proceedings of the 2nd USENIX Conference on File and Storage Technologies
Automatic logging of operating system effects to guide application-level architecture simulation

SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
InteMon: intelligent system monitoring on large clusters

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies

Traveling to Rome: a retrospective on the journey

ACM SIGOPS Operating Systems Review
Capture, conversion, and analysis of an intense NFS workload

FAST '09 Proccedings of the 7th conference on File and storage technologies
LazyBase: freshness vs. performance in information management

ACM SIGOPS Operating Systems Review
Efficiency matters!

ACM SIGOPS Operating Systems Review
Improving the efficiency of information collection and analysis in widely-used IT applications

Proceedings of the 2nd ACM/SPEC International Conference on Performance engineering
LazyBase: trading freshness for performance in a scalable database

Proceedings of the 7th ACM european conference on Computer Systems
Analysis of Workload Behavior in Scientific and Historical Long-Term Data Repositories

ACM Transactions on Storage (TOS)
Extracting flexible, replayable models from large block traces

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Structured and Interoperable Logging for the Cloud Computing Era: The Pitfalls and Benefits

UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing
Virtual machine workloads: the case for new benchmarks for NAS

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
From research to practice: experiences engineering a production metadata database for a scale out file system

FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Structured serial data is used in many scientific fields; such data sets consist of a series of records, and are typically written once, read many times, chronologically ordered, and read sequentially. In this paper we introduce DataSeries, an on-disk format, run-time library and set of tools for storing and analyzing structured serial data. We identify six key properties of a system to store and analyze this type of data, and describe how DataSeries was designed to provide these properties. We quantify the benefits of DataSeries through several experiments. In particular, we demonstrate that DataSeries exceeds the performance of common trace formats by at least a factor of two.