Accelerating large-scale data exploration through data diffusion

Authors:
Ioan Raicu;Yong Zhao;Ian T. Foster;Alex Szalay
Affiliations:
University of Chicago, Chicago, IL, USA;Microsoft Coorporation, Redmond, WA, USA;University of Chicago, Chicago, IL and Argonne National Laboratory, Argonne IL, USA;The Johns Hopkins University, Baltimore, MD, USA
Venue:
DADC '08 Proceedings of the 2008 international workshop on Data-aware distributed computing
Year:
2008

Citing 16
Cited 11

Resource containers: a new facility for resource management in server systems

OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
The Spring System: Integrated Support for Complex Real-TimeSystems

Real-Time Systems
Chord: A scalable peer-to-peer lookup service for internet applications

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
GPFS: A Shared-Disk File System for Large Computing Clusters

FAST '02 Proceedings of the Conference on File and Storage Technologies
A Scalable Architecture for Cooperative Web Caching

Revised Papers from the NETWORKING 2002 Workshops on Web Engineering and Peer-to-Peer Computing
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A survey of Web cache replacement strategies

ACM Computing Surveys (CSUR)
A Peer-to-Peer Replica Location Service Based on a Distributed Hash Table

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
A Survey of Peer-to-Peer Storage Techniques for Distributed File Systems

ITCC '05 Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II - Volume 02
The Globus Striped GridFTP Framework and Server

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Harnessing grid resources to enable the dynamic analysis of large astronomy datasets

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Explicit control a batch-aware distributed file system

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Falkon: a Fast and Light-weight tasK executiON framework

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Integrating local job scheduler – LSFTM with GfarmTM

ISPA'05 Proceedings of the Third international conference on Parallel and Distributed Processing and Applications

Toward loosely coupled programming on petascale systems

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
The quest for scalable support of data-intensive workloads in distributed systems

Proceedings of the 18th ACM international symposium on High performance distributed computing
Middleware support for many-task computing

Cluster Computing
Applying Amdahl's other law to the data center

IBM Journal of Research and Development
Job and data clustering for aggregate use of multiple production cyberinfrastructures

Proceedings of the fifth international workshop on Data-Intensive Distributed Computing Date
A Workflow-Aware Storage System: An Opportunity Study

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Design and analysis of data management in scalable parallel scripting

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Designing and Deploying a Scientific Computing Cloud Platform

GRID '12 Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing
Adapting scientific workflow structures using multi-objective optimization strategies

ACM Transactions on Autonomous and Adaptive Systems (TAAS)
MTC envelope: defining the capability of large scale computers in the context of parallel scripting applications

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
JETS: Language and System Support for Many-Parallel-Task Workflows

Journal of Grid Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data-intensive applications often require exploratory analysis of large datasets. If analysis is performed on distributed resources, data locality can be crucial to high throughput and performance. We propose a "data diffusion" approach that acquires compute and storage resources dynamically, replicates data in response to demand, and schedules computations close to data. As demand increases, more resources are acquired, thus allowing faster response to subsequent requests that refer to the same data; when demand drops, resources are released. This approach can provide the benefits of dedicated hardware without the associated high costs, depending on workload and resource characteristics. The approach is reminiscent of cooperative caching, web-caching, and peer-to-peer storage systems, but addresses different application demands. Other data-aware scheduling approaches assume dedicated resources, which can be expensive and/or inefficient if load varies significantly. To explore the feasibility of the data diffusion approach, we have extended the Falkon resource provisioning and task scheduling system to support data caching and data-aware scheduling. Performance results from both micro-benchmarks and a large scale astronomy application demonstrate that our approach improves performance relative to alternative approaches, as well as provides improved scalability as aggregated I/O bandwidth scales linearly with the number of data cache nodes.