A worldwide flock of Condors: load sharing among workstation clusters
Future Generation Computer Systems - Special issue: resource management in distributed systems
GPFS: A Shared-Disk File System for Large Computing Clusters
FAST '02 Proceedings of the Conference on File and Storage Technologies
Matchmaking: Distributed Resource Management for High Throughput Computing
HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Decoupling Computation and Data Scheduling in Distributed Data-Intensive Applications
HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
Grid Datafarm Architecture for Petascale Data Intensive Computing
CCGRID '02 Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid
Stork: Making Data Placement a First Class Citizen in the Grid
ICDCS '04 Proceedings of the 24th International Conference on Distributed Computing Systems (ICDCS'04)
A fully automated fault-tolerant system for distributed video processing and off-site replication
NOSSDAV '04 Proceedings of the 14th international workshop on Network and operating systems support for digital audio and video
Performance and Scalability of a Replica Location Service
HPDC '04 Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing
A Scalable Multi-Replication Framework for Data Grid
SAINT-W '05 Proceedings of the 2005 Symposium on Applications and the Internet Workshops
Explicit control a batch-aware distributed file system
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
PVFS: a parallel file system for linux clusters
ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
FIRE: A File Reunion Based Data Replication Strategy for Data Grids
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Data transfer in advance on cluster
PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
Hi-index | 0.00 |
Existing data grid scheduling systems handle huge data I/O via replica location services coupled with simple staging, decoupled from scheduling of computing tasks. However, when the application/workflow scales, we observe considerable degradations in performance, compared to processing within a tightly-coupled cluster. For example, when numerous nodes access the same set of files simultaneously, major performance degradation occurs even if replicas are used, due to bottlenecks that manifest in the infrastructure. Instead of resorting to expensive solutions such as parallel file systems, we propose alleviating the situation by tightly coupling replica and data transfer management with computation scheduling. In particular we propose three techniques: (1) dynamic aggregation and O(1) replication of data-staging requests across multiple nodes using a multi-replication framework, (2) replica-centric scheduling - data re-use and time-to-replication as compute scheduling metrics on the grid and (3) overlapped execution of data staging and compute bound tasks. Early benchmark results implemented in our prototype Condor-like grid scheduling system demonstrate that the techniques are quite effective in eliminating much of the overhead in data transfers in many cases.