The design and implementation of a log-structured file system
ACM Transactions on Computer Systems (TOCS)
Minerva: An automated resource provisioning tool for large-scale storage systems
ACM Transactions on Computer Systems (TOCS)
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Pegasus: A framework for mapping complex scientific workflows onto distributed systems
Scientific Programming
Ursa minor: versatile cluster-based storage
FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Explicit control a batch-aware distributed file system
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Falkon: a Fast and Light-weight tasK executiON framework
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Accelerating large-scale data exploration through data diffusion
DADC '08 Proceedings of the 2008 international workshop on Data-aware distributed computing
Data placement for scientific applications in distributed environments
GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
The quest for scalable support of data-intensive workloads in distributed systems
Proceedings of the 18th ACM international symposium on High performance distributed computing
Towards scientific workflow patterns
Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science
Case studies in storage access by loosely coupled petascale applications
Proceedings of the 4th Annual Workshop on Petascale Data Storage
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
The Hadoop Distributed File System
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
AME: an anyscale many-task computing engine
Proceedings of the 6th workshop on Workflows in support of large-scale science
Swift: A language for distributed parallel scripting
Parallel Computing
GPFS-SNC: an enterprise storage framework for virtual-machine clouds
IBM Journal of Research and Development
Design and analysis of data management in scalable parallel scripting
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
VIDAS: object-based virtualized data sharing for high performance storage I/O
Proceedings of the 4th ACM workshop on Scientific cloud computing
Parallelizing the execution of sequential scripts
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Predicting intermediate storage performance for workflow applications
PDSW '13 Proceedings of the 8th Parallel Data Storage Workshop
Hi-index | 0.00 |
This paper evaluates the potential gains a workflow-aware storage system can bring. Two observations make us believe such storage system is crucial to efficiently support workflow-based applications: First, workflows generate irregular and application-dependent data access patterns. These patterns render existing storage systems unable to harness all optimization opportunities as this often requires conflicting optimization options or even conflicting design decision at the level of the storage system. Second, when scheduling, workflow runtime engines make suboptimal decisions as they lack detailed data location information. This paper discusses the feasibility, and evaluates the potential performance benefits brought by, building a workflow-aware storage system that supports per-file access optimizations and exposes data location. To this end, this paper presents approaches to determine the application-specific data access patterns, and evaluates experimentally the performance gains of a workflow-aware storage approach. Our evaluation using synthetic benchmarks shows that a workflow-aware storage system can bring significant performance gains: up to 7x performance gain compared to the distributed storage system - MosaStore and up to 16x compared to a central, well provisioned, NFS server.