Storage optimization for large-scale distributed stream-processing systems

Authors:
Kirsten Hildrum;Fred Douglis;Joel L. Wolf;Philip S. Yu;Lisa Fleischer;Akshay Katta
Affiliations:
IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY;Dartmouth College, Hanover, NH;Amazon Corporation, Seattle, WA
Venue:
ACM Transactions on Storage (TOS)
Year:
2008

Citing 25
Cited 4

The placement optimization program: a practical solution to the disk file assignment problem

SIGMETRICS '89 Proceedings of the 1989 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A calculus of variations approach to file allocation problems in computer systems

SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
The art of computer programming, volume 3: (2nd ed.) sorting and searching

The art of computer programming, volume 3: (2nd ed.) sorting and searching
File Assignment in Parallel I/O Systems with Minimal Variance of Service Time

IEEE Transactions on Computers
Comparative Models of the File Assignment Problem

ACM Computing Surveys (CSUR)
Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Wide-area cooperative storage with CFS

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Minerva: An automated resource provisioning tool for large-scale storage systems

ACM Transactions on Computer Systems (TOCS)
Computer Performance Modeling Handbook

Computer Performance Modeling Handbook
Introduction to Linear Optimization

Introduction to Linear Optimization
Allocating Data and Operations to Nodes in Distributed Database Design

IEEE Transactions on Knowledge and Data Engineering
Parameter interdependencies of file placement models in a Unix system

SIGMETRICS '84 Proceedings of the 1984 ACM SIGMETRICS conference on Measurement and modeling of computer systems
TelegraphCQ: continuous dataflow processing

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
The 8 requirements of real-time stream processing

ACM SIGMOD Record
Network-Aware Operator Placement for Stream-Processing Systems

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Position: short object lifetimes require a delete-optimized storage system

Proceedings of the 11th workshop on ACM SIGOPS European workshop
Multi-site cooperative data stream analysis

ACM SIGOPS Operating Systems Review
Adaptive Control of Extreme-scale Stream Processing Systems

ICDCS '06 Proceedings of the 26th IEEE International Conference on Distributed Computing Systems
Ursa minor: versatile cluster-based storage

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Explicit control a batch-aware distributed file system

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Towards Autonomic Fault Recovery in System-S

ICAC '07 Proceedings of the Fourth International Conference on Autonomic Computing
Autonomic operations in cooperative stream processing systems

HotAC II Hot Topics in Autonomic Computing on Hot Topics in Autonomic Computing
File Placement on Distributed Computer Systems

Computer
Synergy: sharing-aware component composition for distributed stream processing systems

Proceedings of the ACM/IFIP/USENIX 2006 International Conference on Middleware
Time-varying management of data storage

HotDep'05 Proceedings of the First conference on Hot topics in system dependability

Advances and Challenges for Scalable Provenance in Stream Processing Systems

Provenance and Annotation of Data and Processes
SODA: an optimizing scheduler for large-scale stream-based distributed computer systems

Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
COLA: optimizing stream processing applications via graph partitioning

Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
COLA: optimizing stream processing applications via graph partitioning

Middleware'09 Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider storage in an extremely large-scale distributed computer system designed for stream processing applications. In such systems, both incoming data and intermediate results may need to be stored to enable analyses at unknown future times. The quantity of data of potential use would dominate even the largest storage system. Thus, a mechanism is needed to keep the data most likely to be used. One recently introduced approach is to employ retention value functions, which effectively assign each data object a value that changes over time in a prespecified way [Douglis et al.2004]. Storage space for data entering the system is reclaimed automatically by deleting data of the lowest current value. In such large systems, there will naturally be multiple file systems available, each with different properties. Choosing the right file system for a given incoming stream of data presents a challenge. In this article we provide a novel and effective scheme for optimizing the placement of data within a distributed storage subsystem employing retention value functions. The goal is to keep the data of highest overall value, while simultaneously balancing the read load to the file system. The key aspects of such a scheme are quite different from those that arise in traditional file assignment problems. We further motivate this optimization problem and describe a solution, comparing its performance to other reasonable schemes via simulation experiments.